Closed nbusseneau closed 1 year ago
Workaround (fix?) being worked on at https://github.com/cilium/cilium/pull/22600.
Assigning @tommyp1ckles but that's just to reflect who is looking into it at the moment.
I've traced down the logs of some of the cases of this happening. It appears that each one is the result of a Pod being deleted just prior Cilium restarting, seemingly related to cleaning up tests.
Example: pod/test-k8s2-7f96d84c65-9s5cb (in last example: a4563fb1_K8sAgentHubbleTest_Hubble_Observe_Test_L3-L4_Flow.zip
Only missing piece is where these CEPs are getting deleted, if the EndpointSyncronizer is failing to delete due to empty UID, and the CNI delete is failing, what’s removing the CEP? Probably the operator?
In other cases, the origin of the delete is clean (it is logged as being part of the test). Importantly, they all share the same failure to restore due to missing Pod.
So, however they're getting into that state, I think this looks ok as far as CEP cleanup doing its just. It is notable, that the CEP is still in the cache during the CEP cleanup, and because restore fails the cleanup attempts to delete CEPs that no longer exist.
tldr the unusual logs seem to be related to the close proximity of cleaning up old test pods and restart ciliium for a new test suite. In this case, we should make changes similar to Jarnos where we avoid logging a warning in that situation (info is fine).
Only missing piece is where these CEPs are getting deleted, if the EndpointSyncronizer is failing to delete due to empty UID, and the CNI delete is failing, what’s removing the CEP? Probably the operator?
Could we add debug logs in the operator to confirm that? I.e., not in a PR but actually merged in master as it's useful longer term as well, no?
So, however they're getting into that state, I think this looks ok as far as CEP cleanup doing its just. It is notable, that the CEP is still in the cache during the CEP cleanup, and because restore fails the cleanup attempts to delete CEPs that no longer exist.
I'm a bit confused. Are you saying the CEP doesn't exist because the endpoint couldn't be restored or it doesn't exist because it was deleted by the operator?
tldr the unusual logs seem to be related to the close proximity of cleaning up old test pods and restart ciliium for a new test suite. In this case, we should make changes similar to Jarnos where we avoid logging a warning in that situation (info is fine).
On the contrary, I'd go for a warning here. Warnings don't currently fail our CI and even if they did we could add an exception. The reason I'd go for a warning is because it sounds like you're saying it's an artifact of how our CI runs, but shouldn't typically happen in user environments.
Could we add debug logs in the operator to confirm that? I.e., not in a PR but actually merged in master as it's useful longer term as well, no?
:+1:
I'm a bit confused. Are you saying the CEP doesn't exist because the endpoint couldn't be restored or it doesn't exist because it was deleted by the operator?
By the time the restore happens, I know the Pod is missing (and I assume CEP), the endpoint is still in the restore state so it tries to restore it but fails because the Pod is missing.
At the same time, the CEP is still in the k8s cache but is not actually being managed, so it is cleaned up but fails because the CEP is not in API server.
On the contrary, I'd go for a warning here. Warnings don't currently fail our CI and even if they did we could add an exception. The reason I'd go for a warning is because it sounds like you're saying it's an artifact of how our CI runs, but shouldn't typically happen in user environments.
Makes sense, maybe info in that case? since we're capturing the NotFound scenario separately.
On the contrary, I'd go for a warning here. Warnings don't currently fail our CI and even if they did we could add an exception. The reason I'd go for a warning is because it sounds like you're saying it's an artifact of how our CI runs, but shouldn't typically happen in user environments.
Makes sense, maybe info in that case? since we're capturing the NotFound scenario separately.
I think you're better able to judge than me in this case.
In general, I'd consider warning cases only for when we know something is wrong but we don't expect it should affect normal operations. If it affected normal operations, it should be an error. If we are unsure something is wrong, it should be info, mostly because a warning for normal operations is very annoying to users as they can't really avoid it (and thus it tends to normalize warnings in logs).
In general, I'd consider warning cases only for when we know something is wrong but we don't expect it should affect normal operations. If it affected normal operations, it should be an error. If we are unsure something is wrong, it should be info, mostly because a warning for normal operations is very annoying to users as they can't really avoid it (and thus it tends to normalize warnings in logs).
Yeah I tend to prefer info over warning, unless the impact is really unclear.
In this case, I would say that a no-op cleanup is expected behavior (in the case of the cep being missing already) and an info is sufficient.
Updated PR here: https://github.com/cilium/cilium/pull/22600
But we still don't really know who deleted the CEP, right?
If the Pod is missing at startup restore, then it must have been deleted (it's not like the CEP is just missing with the Pod sticking around). I'm just confused since I see PodSandboxErrors for missing CNI events with no successful delete.
Somehow, it seems like a legitimate sandbox delete went through...
The reason the CEP is still in the CES at startup is seemingly that the operator is stuck blocking on acquiring a lease, so it doesn't actually start running until a few minutes after this. So that explains why the cache has a CES with missing CEPs at least.
I'll look at the method of teardown between these tests this afternoon, maybe there's something that's just not being capture in the logs.
Could we add debug logs in the operator to confirm that? I.e., not in a PR but actually merged in master as it's useful longer term as well, no?
I don't think the operator is actually doing anything here, as its not actually running at that time.
Looking through a few more instances of the warning:
2022-12-01T20:11:11.009769887Z level=debug msg="Error regenerating endpoint: BPF template compilation failed: failed to compile template program /var/run/cilium/state/templates/03972ae32602b2e12bcd141f504f54c66386382f8301f74d20ea4fedb2ca379d: Failed to compile bpf_lxc.dbg.o: Command execution failed for [llc -march=bpf -mcpu=v1 -filetype=obj -o /var/run/cilium/state/templates/03972ae32602b2e12bcd141f504f54c66386382f8301f74d20ea4fedb2ca379d/bpf_lxc.dbg.o]: context canceled" code=Failure containerID=599f17c102 datapathPolicyRevision=0 desiredPolicyRevision=1 endpointID=1764 endpointState=waiting-to-regenerate identity=54977 ipv4=10.0.0.156 ipv6=10.0.0.156 k8sPodName=default/echo-55fdf5787d-2jr79 policyRevision=0 subsys=endpoint type=200
...
^ Not sure why this happened, but it seems to have resulted in EP restore failing out.
2022-12-03T19:03:21.168642187Z level=debug msg="CEP deleted, calling endpointDeleted" CEPName=kube-system/coredns-69b675786c-ws4nv CES
Name=ces-bcn6s7dw7-wwby7 subsys=k8s-watcher
...
2022-12-03T19:03:23.138284852Z level=error msg="Failed to get possibly stale ciliumendpoints from apiserver, skipping." ciliumEndpointName=coredns-69b675786c-ws4nv error="ciliumendpoints.cilium.io \"coredns-69b675786c-ws4nv\" not found" k8sNamespace=kube-system subsys=daemon
1 testclient-2-s76t8.172d5d1d962971a2
default 115s Normal Killing pod/testclient-2-s76t8 spec.containers{web}
kubelet, k8s1 Stopping container
...
2022-12-03T18:53:04.082902496Z level=error msg="Failed to get possibly stale ciliumendpoints from apiserver, skipping." ciliumEndpointName=testclient-2-s76t8 error="ciliumendpoints.cilium.io \"testclient-2-s76t8\" not found" k8sNamespace=default subsys=daemon
Looking through a few more cases in the test dumps, most instances some explanation for why the CEP is missing at the time of the cleanup. That previous case was the one where it was a bit unclear where exactly the delete happened.
The cleanup between tests is what's causing the situation to arise more often than you would expect under normal circumstances.
There's some interesting cases around some of the circumstances preceeding the log, but I don't see much of a pattern.
PR was merged, closing for now.
Test Name
Failure Output
Stack Trace
Standard Output
Standard Error
Resources
Anything else?
This looks very similar to #21175 with the difference being that #21175's error log is
Cannot create CEP
. It even confused MLH (see https://github.com/cilium/cilium/issues/21175#issuecomment-1341041949).