Open skmatti opened 1 year ago
@skmatti Thank you for the writeup, this is indeed a catastrophic scenario. It's also sort-of expected for things to break when the policy map gets full. It needs to be monitored and properly-sized depending on the amount of identities you expect to maintain in the cluster.
Is there something specific you'd like to point out, or is this more of a question? We're working on improving map reconciliation logic for 1.15/1.16, but a full map is a full map. To recover from this, I guess you could unpin the policy map from /sys/fs/bpf on all nodes and restart all agents. This should at least unblock the policy update loop if that's what's causing the disruption.
My guess is there might've been some policy updates happening in the background while you were adding nodes/Pods, leading to an inconsistent view of the world. When resource limits are hit, it's hard to predict what'll happen.
Thanks @ti-mo for the response.
Is there something specific you'd like to point out, or is this more of a question?
I think it would be good have some ordering (or priority) when populating the policy maps. Initially, the policy map has the remote-node entry to allow traffic from other control plane nodes. New endpoints that resulted in map overflow should not replace the existing map entries (remote-node
). Currently, this is not the case.
Heads up there is also some very relevant discussion in issue https://github.com/cilium/cilium/issues/27866 regarding how to formulate policies to minimize the likelihood of this scenario.
Note that remote-node entity is very critical for cluster operation as etcd instances of all admin nodes must communicate with each other and establish quorum for the API server to be healthy.
I am wondering whether a toNodes
selector in policy (by node labels) would provide a more reliable mechanism for such a network policy. Ie, the feature proposed in https://github.com/cilium/cilium/issues/19121.
One other idea that could assist with this is if there could be a way to express some sort of priority for resolution of policy entries so that in this particular scenario, control plane traffic could be somehow prioritized. There is no such notion in Cilium's policy engine today. I'm not sure whether this should be to simply have a hardcoded priority ordering like "remote-node is more important than others" along the lines of the suggestion above, or something more explicit.
(swapping agent / datapath, because although the issue is triggered by a datapath map being full, I think that a big part of this issue is caused by the agent policy calculation pieces. I don't anticipate any BPF or even userspace Go datapath management side changes in order to improve Cilium's behaviour in this scenario. The solutions are more likely towards higher levels of policy calculation, perhaps even touching on APIs.)
Is there an existing issue for this?
What happened?
We have host firewall enabled for all admin nodes of a cluster. We configured our host policies to allow traffic from the entities and a few CIDRs
Note that
remote-node
entity is very critical for cluster operation as etcd instances of all admin nodes must communicate with each other and establish quorum for the API server to be healthy.If the number of identities in the cluster are more than 16k, then the policy map for the host endpoint is full and endpoint is unhealthy. Our cluster ended up in a situation where the etcd connectivity between all the admin nodes is broken (possibly due to
remote-node
entry missing from the bpf policy). The API server calls from various components including cilium agent and operator fails due to this issue. The host endpoint regeneration kept failing with:At this point, cluster went to permanent failure state due to the following issues
What caused the the traffic between the etcd instances to drop when the policy map is full?. Is it possible that
remote-node
entry is missing the map?. Why is the remote node IP resolved toworld
and local node IP resolved tounknown
?.Cilium Version
1.12
Kernel Version
Linux 5.4.0-162-generic
Kubernetes Version
1.27
Sysdump
No response
Relevant log output
No response
Anything else?
No response
Code of Conduct