Closed sethp-verica closed 2 years ago
@sethp-verica Thanks for the bug report! This is now fixed in https://github.com/cilium/cilium/pull/18894, which I hope merges to Cilium master soon. The fix skips over all the Envoy resources that can't be parsed and logs them on warning level to Cilium agent logs. The rest of the resources are applied as if the failing resource did not exist.
Is there an existing issue for this?
What happened?
I created a CRD with a resource type that wasn't recognized, and reconciliation stopped. Specifically: listing/watching for CilumEnvoyConfig resources started failing at the deserailiztion step. This had the further consequence of causing a (very slow) cilium-agent failure mode where it'd block on the "waiting for caches" step of startup for 5 minutes before crashing.
This is reproducible on any kubernetes cluster running the service-mesh-beta (WARNING: will break updates for service endpoints &c!):
The second command forces a restart on some cilium agent to put it in the pre-cache-sync state.
Service can be restored with:
Cilium Version
This happened when using the software from the tip of the beta/service-mesh branch.
Kernel Version
Kubernetes Version
Sysdump
Happy to grab this if it's helpful
Relevant log output
Anything else?
Two half-baked ideas I had about how to address this, if they're useful:
json.RawMessage
or similar) to separate the kubernetes resource listing/watching from the envoy-specific config details. The main downside here is that the system seems to accept the instance of the CR but then doesn't configure envoy with it, and it wasn't clear to me where I'd go to see if convergence was successful or not (I looked for e.g.status.conditions
on the CR and events before getting lucky with finding/identifying agent logs).In either case, a few more events re: the CiliumEnvoyConfig feel like they would've been useful to help me identify what part of the system was having trouble. I was comparing against a working config, so even something as simple as a "ConfigAccepted" event would have helped me narrow down that it was a config-specific problem.
Code of Conduct