Policy map full for host endpoints can break a cluster permanently

skmatti commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues

What happened?

We have host firewall enabled for all admin nodes of a cluster. We configured our host policies to allow traffic from the entities and a few CIDRs

host
remote-node
cluster

Note that remote-node entity is very critical for cluster operation as etcd instances of all admin nodes must communicate with each other and establish quorum for the API server to be healthy.

If the number of identities in the cluster are more than 16k, then the policy map for the host endpoint is full and endpoint is unhealthy. Our cluster ended up in a situation where the etcd connectivity between all the admin nodes is broken (possibly due to remote-node entry missing from the bpf policy). The API server calls from various components including cilium agent and operator fails due to this issue. The host endpoint regeneration kept failing with:

2023-09-27T18:25:19.191720436Z level=error msg="Failed to add PolicyMap key" bpfMapKey="{6584397 0 0 0}" containerID= datapathPolicyRevision=1255 desiredPolicyRevision=1283 endpointID=1044 error="Unable to update element for cilium_policy_01044 map with file descriptor 105: the map is full, please consider resizing it. argument list too long" identity=128 ipv4= ipv6= k8sPodName=/ port=0 subsys=endpoint
2023-09-27T18:25:19.191730360Z level=warning msg="generating BPF for endpoint failed, keeping stale directory." containerID= datapathPolicyRevision=1255 desiredPolicyRevision=1283 endpointID=1044 file-path=1044_next_fail identity=128 ipv4= ipv6= k8sPodName=/ subsys=endpoint

At this point, cluster went to permanent failure state due to the following issues

The host firewall dropped the traffic between the etcd instances resulting in API server unhealthiness

Policy verdict log: flow 0x4cd9aa1 local EP ID 1445, remote ID kube-apiserver, proto 6, ingress, action deny, match none, 10.252.215.2:47662 -> 10.252.215.3:2383 tcp SYN
xx drop (Policy denied) flow 0x4cd9aa1 to endpoint 0, file bpf_host.c line 711, , identity world->unknown: 10.252.215.2:47662 -> 10.252.215.3:2383 tcp SYN

The cilium operator cannot garbage collect (API server is broken) the cilium identities to keep policy entries under the limit
The host endpoint could not be updated because the policy map is full

What caused the the traffic between the etcd instances to drop when the policy map is full?. Is it possible that remote-node entry is missing the map?. Why is the remote node IP resolved to world and local node IP resolved to unknown?.

Cilium Version

1.12

Kernel Version

Linux 5.4.0-162-generic

Kubernetes Version

1.27

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

[X] I agree to follow this project's Code of Conduct

ti-mo commented 1 year ago

@skmatti Thank you for the writeup, this is indeed a catastrophic scenario. It's also sort-of expected for things to break when the policy map gets full. It needs to be monitored and properly-sized depending on the amount of identities you expect to maintain in the cluster.

Is there something specific you'd like to point out, or is this more of a question? We're working on improving map reconciliation logic for 1.15/1.16, but a full map is a full map. To recover from this, I guess you could unpin the policy map from /sys/fs/bpf on all nodes and restart all agents. This should at least unblock the policy update loop if that's what's causing the disruption.

My guess is there might've been some policy updates happening in the background while you were adding nodes/Pods, leading to an inconsistent view of the world. When resource limits are hit, it's hard to predict what'll happen.

skmatti commented 1 year ago

Thanks @ti-mo for the response.

Is there something specific you'd like to point out, or is this more of a question?

I think it would be good have some ordering (or priority) when populating the policy maps. Initially, the policy map has the remote-node entry to allow traffic from other control plane nodes. New endpoints that resulted in map overflow should not replace the existing map entries (remote-node). Currently, this is not the case.

joestringer commented 1 year ago

Heads up there is also some very relevant discussion in issue https://github.com/cilium/cilium/issues/27866 regarding how to formulate policies to minimize the likelihood of this scenario.

joestringer commented 1 year ago

Note that remote-node entity is very critical for cluster operation as etcd instances of all admin nodes must communicate with each other and establish quorum for the API server to be healthy.

I am wondering whether a toNodes selector in policy (by node labels) would provide a more reliable mechanism for such a network policy. Ie, the feature proposed in https://github.com/cilium/cilium/issues/19121.

joestringer commented 1 year ago

One other idea that could assist with this is if there could be a way to express some sort of priority for resolution of policy entries so that in this particular scenario, control plane traffic could be somehow prioritized. There is no such notion in Cilium's policy engine today. I'm not sure whether this should be to simply have a hardcoded priority ordering like "remote-node is more important than others" along the lines of the suggestion above, or something more explicit.

joestringer commented 1 year ago

(swapping agent / datapath, because although the issue is triggered by a datapath map being full, I think that a big part of this issue is caused by the agent policy calculation pieces. I don't anticipate any BPF or even userspace Go datapath management side changes in order to improve Cilium's behaviour in this scenario. The solutions are more likely towards higher levels of policy calculation, perhaps even touching on APIs.)

cilium / cilium