cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
19.71k stars 2.89k forks source link

CI: ci-e2e: check-log-errors ` level=error msg="Failed to watch *v2.CiliumClusterwideNetworkPolicy": the server could not find the requested resource` #26591

Open gandro opened 1 year ago

gandro commented 1 year ago

CI failure

Observed twice in [Setup & Test (1, 4.19-20230420.212204, iptables, false, vxlan)] (https://github.com/cilium/cilium/actions/runs/5401014338/jobs/9810263072#logs)

Found "2023-06-28T12:27:57.917020273Z level=error msg=k8sError error=\"github.com/cilium/cilium/pkg/k8s/resource/resource.go:304: Failed to watch *v2.CiliumClusterwideNetworkPolicy: failed to list *v2.CiliumClusterwideNetworkPolicy: the server could not find the requested resource (get ciliumclusterwidenetworkpolicies.cilium.io)\" subsys=k8s" in logs 1 times

  ❌ Found 1 logs matching list of errors that must be investigated:
2023-06-28T12:27:57.917020273Z level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/resource/resource.go:304: Failed to watch *v2.CiliumClusterwideNetworkPolicy: failed to list *v2.CiliumClusterwideNetworkPolicy: the server could not find the requested resource (get ciliumclusterwidenetworkpolicies.cilium.io)" subsys=k8s

πŸ“‹ Test Report
❌ 1/51 tests failed (0/549 actions), 10 tests skipped, 1 scenarios skipped:
Test [check-log-errors]:
connectivity test failed: 1 tests failed
Error: Process completed with exit code 1.

Edit: Sysdump https://github.com/cilium/cilium/files/11935233/cilium-sysdump-1-final.zip

brb commented 1 year ago

Unfortunately, the sysdump is missing. Most likely it's a race (the operator didn't finish creating the CRD type), and thus red-herring. cc @cilium/sig-k8s

gandro commented 1 year ago

I have a local copy of the sysdump (it seems it has now been overwritten since I restarted the run, I also cannot find the link anymore):

cilium-sysdump-1-final.zip

In only did a superficial triage, but I also think that this is most likely a race. The error should likely not be an error if we are retrying (which the agent seems to do here, i.e. I later see it handling CNPs correctly)

christarazi commented 1 year ago

I think this might be a recent regression because AFAIK, no Cilium code should proceed that requires the CCNP / CNP CRD to exist until Cilium has waited for all CRDs to be created. So I suspect some sort of ordering of operations was subtly changed, cc @cilium/sig-foundations.

pippolo84 commented 1 year ago

I took a look at the issue together with @bimmlerd .

The CRDs creation has been recently moved to its own cell in the operator, and it is now the first one invoked after the leader election. Before this, it was executed right before the leader started. This is suspicious, but it shouldn't be a problem, since the agent blocks here:

https://github.com/cilium/cilium/blob/c068a443049b5dedb34608069df4b1721b1e4669/daemon/cmd/daemon.go#L864

until all CRDs are registered. CCNP is listed among the others. That algorithm looks correct and it's been the same since a long time.

The error in the agent log comes from the Events() method called in cilium_network_policy.go that starts the CCNP informer. But the agent k8s watchers are started in InitK8sSubsystem that is after the blocking wait above.

One thing to note is the timing in the operator log:

2023-06-29T15:17:22.586718280Z level=info msg="Leading the operator HA deployment" subsys=cilium-operator-generic
2023-06-29T15:17:22.598339583Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliuml2announcementpolicies.cilium.io subsys=k8s
2023-06-29T15:17:22.623310757Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumcidrgroups.cilium.io subsys=k8s
2023-06-29T15:17:22.623333202Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumidentities.cilium.io subsys=k8s
2023-06-29T15:17:22.623338202Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumnodes.cilium.io subsys=k8s
2023-06-29T15:17:22.623352609Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumexternalworkloads.cilium.io subsys=k8s
2023-06-29T15:17:22.623356881Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumpodippools.cilium.io subsys=k8s
2023-06-29T15:17:22.625429583Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumloadbalancerippools.cilium.io subsys=k8s
2023-06-29T15:17:22.625447211Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumnodeconfigs.cilium.io subsys=k8s
2023-06-29T15:17:23.793431828Z level=info msg="Waited for 1.167053896s due to client-side throttling, not priority and fairness, request: POST:https://10.96.0.1:443/apis/apiextensions.k8s.io/v1/customresourcedefinitions" subsys=klog
2023-06-29T15:17:24.007795242Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumendpoints.cilium.io subsys=k8s
2023-06-29T15:17:24.201024783Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumclusterwidenetworkpolicies.cilium.io subsys=k8s
2023-06-29T15:17:24.400235263Z level=info msg="Creating CRD (CustomResourceDefinition)..." name=ciliumnetworkpolicies.cilium.io subsys=k8s

The rate limiting algorithm slowed down the creation of 3 last CRDs and CCNP is among these. But again, the agent is not relying on timing, so this shouldn't be a problem.

One other thing to note is that the operator reports the CCNP CRD creation time at 2023-06-29T15:17:24.007795242Z, that is before the agent log reports All Cilium CRDs have been found and are available at 2023-06-29T15:17:26.748901129Z. But I don't know how much comparing the timestamps between two different pods is reliable.

Another thing we checked is if somehow the CCNP CRD got cancelled after the agent waited for CRDs creation, but I cannot find any CRD delete in the entire codebase (and I'm not aware of anything like that).

I also tried a quick repro with a local kind cluster, adding an artificial time.Sleep in the operator to slow down the CCNP CRD creation, but the agent waited for it before starting the watchers.

Update: I was wrongly focusing on CNP while the error is about CCNP. Anyway, it does not change the analysis, nor the conclusions. Fixed.

gandro commented 1 year ago

I grepped for the error message and then discovered this:

https://github.com/cilium/cilium/blob/041036be1c95d352f6f3f19323dc5ec17537273a/test/helpers/cons.go#L313-L315

So this might actually not be a new issue. Just that our check-log-errors does not have a badLogMessages exception like our old CI does: https://github.com/cilium/cilium/issues/16425

bimmlerd commented 1 year ago

Thanks to Sebastian's hint we looked at the CRD sync logic a bit more critically - if the issue has been present for that long, it predates modularization and is probably just some improbable race. And indeed; we believe we have a potential race between CRD creation and the CRD being ready to be watched.

In the daemon code which waits for CRDs, https://github.com/cilium/cilium/blob/1a45357601326e10d86bcf0d1e11a3d982f359fe/pkg/k8s/synced/crd.go#L95 there's an optimization to avoid fetching the full objects. As a result, we don't have access to the Status field of the CRD, and hence don't know whether we already have the established Condition, which seems to indicate that K8s is ready to be queried about this CRD.

Compare this with other code which waits for CRD registration: https://github.com/cilium/cilium/blob/1a45357601326e10d86bcf0d1e11a3d982f359fe/pkg/k8s/apis/cilium.io/client/register.go#L399 in our code base, or from a helm PR fixing a similar race, both of which explicitly wait for the established condition to be true.

We thus believe the above error occurs if the daemon proceeds to start an informer for a CRD which was added, but isn't yet ready to be watched (i.e., doesn't have status established).

It's unclear to me whether we can change the custom CRD getter to also include the status field (and whether that doesn't defeat the purpose of the optimization), hence paging @christarazi who committed the optimized version of the CRD getter. The current hypothesis is that waiting for the correct status condition to be established would solve this race, but it's not obvious to me whether that would cause a performance regression.

brb commented 1 year ago

To close the issue - https://github.com/cilium/cilium-cli/pull/1797.

christarazi commented 1 year ago

@bimmlerd Thanks for the investigation along with @pippolo84.

The original motivation behind the optimized fetch of the CRDs is to prevent all nodes from having to download the entire CRD object (see the sizes of the files from examples/crd/... for a rough estimate), or rather flipped around, preventing the kube-apiserver from having to send CRD objects to all nodes.

The way the optimization works is by stripped down the data sent from the apiserver to a minimum by using the PartialObjectMetadata structure. With this, we've lost the actual CustomResourceDefinition object, so we don't have access to the status field.

Reversing the optimization may introduce a a period of heavy stress on the apiserver as Cilium starts up, especially in a large cluster. While we don't have data to compare this against, it seems that reversing this optimization just to fix this occasional flake is not quite worth doing. What Martynas proposed seems like a reasonable alternative, especially given the exception existed in the Ginkgo suite.

If we find a way to include CustomResourceDefinitionStatus along with keeping the PartialObjectMetadata structure, then that'd be great, but not optimistic that it's possible.

bimmlerd commented 1 year ago

TL;DR, solving the flake as suggested by Martynas is correct, since the agent recovers on its own.

Thanks @christarazi for the additional context, I agree that removing the optimization is not worth it. As Sebastian already mentioned, the agent recovers from initially failing to watch, thanks to the resource's informer eventually calling down to this function. In our case, the calls goes to some default error handler (I didn't see a way to override it, unfortunately) which causes the error log, but doesn't stop the informer. After some backoff the watch succeeds, and the agent has recovered.

thorn3r commented 2 months ago

hit a similar one: https://github.com/cilium/cilium/actions/runs/9598146987/job/26468777743 PR: https://github.com/cilium/cilium/pull/33240 sysdump: cilium-sysdump-7-final.zip