Closed vladimirtiukhtin closed 7 months ago
Yeah, we've had several issues like this opened over the years. We've done everything that we can to speed up kube-router, but in the end, 99% of the processing time is due to syscalls out to ipset / iptables which don't have API's.
It would be nice if the kubelet gave network providers a way to indicate that the pod sandbox is fully setup before it started the container, but unfortunately, that has never been made part of the spec.
I currently run kube-router in several large-ish sized clusters (100+ nodes with 1,000's of pods) and network policy reliably runs on average in less than 2 seconds, but for containers that need to initiate network immediately we still recommend that users put delays in, or include application level failure handling on startup.
As a slight side note, if the times that you see for policy is really far out of bounds for your cluster size (e.g. more than 5 - 10 ish seconds) its worth keeping an eye out for kernel issues.
This performance regression recently hit a cluster I was working on a while back: https://lore.kernel.org/lkml/b333bc85-83ea-8869-ccf7-374c9456d93c@blackhole.kfki.hu/T/
And that's definitely not the first time that kube-router has run into upstream issues with either the netfilter code in the kernel, or the netfilter user space.
What happened? I began to get
connection reset by peer
like messages on my jobs/cronjobs after turning on networkpoliciesWhat did you expect to happen? Policies to work
How can we reproduce the behavior you experienced? Steps to reproduce the behavior:
Screenshots / Architecture Diagrams / Network Topologies Not working variant:
Working variant:
Additional context As you can see kube-router is not fast enough in handling fresh pods. To be honest I think kube-router is not the one to blame here