Open gandro opened 1 year ago
Unfortunately, the sysdump is missing. Most likely it's a race (the operator didn't finish creating the CRD type), and thus red-herring. cc @cilium/sig-k8s
I have a local copy of the sysdump (it seems it has now been overwritten since I restarted the run, I also cannot find the link anymore):
In only did a superficial triage, but I also think that this is most likely a race. The error should likely not be an error if we are retrying (which the agent seems to do here, i.e. I later see it handling CNPs correctly)
I think this might be a recent regression because AFAIK, no Cilium code should proceed that requires the CCNP / CNP CRD to exist until Cilium has waited for all CRDs to be created. So I suspect some sort of ordering of operations was subtly changed, cc @cilium/sig-foundations.
I took a look at the issue together with @bimmlerd .
The CRDs creation has been recently moved to its own cell in the operator, and it is now the first one invoked after the leader election. Before this, it was executed right before the leader started. This is suspicious, but it shouldn't be a problem, since the agent blocks here:
until all CRDs are registered. CCNP is listed among the others. That algorithm looks correct and it's been the same since a long time.
The error in the agent log comes from the Events()
method called in cilium_network_policy.go
that starts the CCNP informer. But the agent k8s watchers are started in InitK8sSubsystem
that is after the blocking wait above.
One thing to note is the timing in the operator log:
2023-06-29T15:17:22.586718280Z level=info msg="Leading the operator HA deployment" subsys=cilium-operator-generic
2023-06-29T15:17:22.598339583Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliuml2announcementpolicies.cilium.io subsys=k8s
2023-06-29T15:17:22.623310757Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumcidrgroups.cilium.io subsys=k8s
2023-06-29T15:17:22.623333202Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumidentities.cilium.io subsys=k8s
2023-06-29T15:17:22.623338202Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumnodes.cilium.io subsys=k8s
2023-06-29T15:17:22.623352609Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumexternalworkloads.cilium.io subsys=k8s
2023-06-29T15:17:22.623356881Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumpodippools.cilium.io subsys=k8s
2023-06-29T15:17:22.625429583Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumloadbalancerippools.cilium.io subsys=k8s
2023-06-29T15:17:22.625447211Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumnodeconfigs.cilium.io subsys=k8s
2023-06-29T15:17:23.793431828Z level=info msg="Waited for 1.167053896s due to client-side throttling, not priority and fairness, request: POST:https://10.96.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions" subsys=klog
2023-06-29T15:17:24.007795242Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumendpoints.cilium.io subsys=k8s
2023-06-29T15:17:24.201024783Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumclusterwidenetworkpolicies.cilium.io subsys=k8s
2023-06-29T15:17:24.400235263Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumnetworkpolicies.cilium.io subsys=k8s
The rate limiting algorithm slowed down the creation of 3 last CRDs and CCNP is among these. But again, the agent is not relying on timing, so this shouldn't be a problem.
One other thing to note is that the operator reports the CCNP CRD creation time at 2023-06-29T15:17:24.007795242Z
, that is before the agent log reports All Cilium CRDs have been found and are available
at 2023-06-29T15:17:26.748901129Z
.
But I don't know how much comparing the timestamps between two different pods is reliable.
Another thing we checked is if somehow the CCNP CRD got cancelled after the agent waited for CRDs creation, but I cannot find any CRD delete in the entire codebase (and I'm not aware of anything like that).
I also tried a quick repro with a local kind cluster, adding an artificial time.Sleep
in the operator to slow down the CCNP CRD creation, but the agent waited for it before starting the watchers.
Update: I was wrongly focusing on CNP while the error is about CCNP. Anyway, it does not change the analysis, nor the conclusions. Fixed.
I grepped for the error message and then discovered this:
So this might actually not be a new issue. Just that our check-log-errors
does not have a badLogMessages
exception like our old CI does: https://github.com/cilium/cilium/issues/16425
Thanks to Sebastian's hint we looked at the CRD sync logic a bit more critically - if the issue has been present for that long, it predates modularization and is probably just some improbable race. And indeed; we believe we have a potential race between CRD creation and the CRD being ready to be watched.
In the daemon code which waits for CRDs, https://github.com/cilium/cilium/blob/1a45357601326e10d86bcf0d1e11a3d982f359fe/pkg/k8s/synced/crd.go#L95
there's an optimization to avoid fetching the full objects. As a result, we don't have access to the Status
field of the CRD, and hence don't know whether we already have the established
Condition, which seems to indicate that K8s is ready to be queried about this CRD.
Compare this with other code which waits for CRD registration: https://github.com/cilium/cilium/blob/1a45357601326e10d86bcf0d1e11a3d982f359fe/pkg/k8s/apis/cilium.io/client/register.go#L399 in our code base, or from a helm PR fixing a similar race, both of which explicitly wait for the established
condition to be true.
We thus believe the above error occurs if the daemon proceeds to start an informer for a CRD which was added, but isn't yet ready to be watched (i.e., doesn't have status established
).
It's unclear to me whether we can change the custom CRD getter to also include the status field (and whether that doesn't defeat the purpose of the optimization), hence paging @christarazi who committed the optimized version of the CRD getter. The current hypothesis is that waiting for the correct status condition to be established would solve this race, but it's not obvious to me whether that would cause a performance regression.
To close the issue - https://github.com/cilium/cilium-cli/pull/1797.
@bimmlerd Thanks for the investigation along with @pippolo84.
The original motivation behind the optimized fetch of the CRDs is to prevent all nodes from having to download the entire CRD object (see the sizes of the files from examples/crd/...
for a rough estimate), or rather flipped around, preventing the kube-apiserver from having to send CRD objects to all nodes.
The way the optimization works is by stripped down the data sent from the apiserver to a minimum by using the PartialObjectMetadata
structure. With this, we've lost the actual CustomResourceDefinition
object, so we don't have access to the status field.
Reversing the optimization may introduce a a period of heavy stress on the apiserver as Cilium starts up, especially in a large cluster. While we don't have data to compare this against, it seems that reversing this optimization just to fix this occasional flake is not quite worth doing. What Martynas proposed seems like a reasonable alternative, especially given the exception existed in the Ginkgo suite.
If we find a way to include CustomResourceDefinitionStatus
along with keeping the PartialObjectMetadata
structure, then that'd be great, but not optimistic that it's possible.
TL;DR, solving the flake as suggested by Martynas is correct, since the agent recovers on its own.
Thanks @christarazi for the additional context, I agree that removing the optimization is not worth it. As Sebastian already mentioned, the agent recovers from initially failing to watch, thanks to the resource's informer eventually calling down to this function. In our case, the calls goes to some default error handler (I didn't see a way to override it, unfortunately) which causes the error log, but doesn't stop the informer. After some backoff the watch succeeds, and the agent has recovered.
CI failure
Observed twice in
[Setup & Test (1, 4.19-20230420.212204, iptables, false, vxlan)]
(https://github.com/cilium/cilium/actions/runs/5401014338/jobs/9810263072#logs)Edit: Sysdump https://github.com/cilium/cilium/files/11935233/cilium-sysdump-1-final.zip