cilium / cilium-service-mesh-beta

Instructions and issue tracking for Service Mesh capabilities of Cilium
Apache License 2.0
104 stars 14 forks source link

Invalid CiliumEnvoyConfig interrupts reconciliation for all types, prevents cilium-agent startup #27

Closed sethp-verica closed 2 years ago

sethp-verica commented 2 years ago

Is there an existing issue for this?

What happened?

I created a CRD with a resource type that wasn't recognized, and reconciliation stopped. Specifically: listing/watching for CilumEnvoyConfig resources started failing at the deserailiztion step. This had the further consequence of causing a (very slow) cilium-agent failure mode where it'd block on the "waiting for caches" step of startup for 5 minutes before crashing.

This is reproducible on any kubernetes cluster running the service-mesh-beta (WARNING: will break updates for service endpoints &c!):

kubectl apply -f - <<EOF                              
apiVersion: cilium.io/v2alpha1
kind: CiliumEnvoyConfig
metadata:
  name: bad-config
spec:
  resources:
    - "@type": type.googleapis.com/envoy.config.listener.v3.Listenerz
EOF

kubectl exec -n kube-system ds/cilium -- kill 1

The second command forces a restart on some cilium agent to put it in the pre-cache-sync state.

Service can be restored with:

kubectl delete ciliumenvoyconfig bad-config

Cilium Version

This happened when using the software from the tip of the beta/service-mesh branch.

Kernel Version

$ uname -a
Linux d3abe6123727 5.10.104-linuxkit #1 SMP Wed Mar 9 19:05:23 UTC 2022 x86_64 GNU/Linux

Kubernetes Version

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-06-16T12:59:11Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

Happy to grab this if it's helpful

Relevant log output

level=warning msg="github.com/cilium/cilium/pkg/k8s/watchers/cilium_envoy_config.go:92: failed to list *v2alpha1.CiliumEnvoyConfig: proto: (line 1:10): unable to resolve \"type.googleapis.com/envoy.config.listener.v3.Listenerz\": \"not found\"" subsys=klog
level=error msg=k8sError error="github.com/cilium/cilium/pkg/k8s/watchers/cilium_envoy_config.go:92: Failed to watch *v2alpha1.CiliumEnvoyConfig: failed to list *v2alpha1.CiliumEnvoyConfig: proto: (line 1:10): unable to resolve \"type.googleapis.com/envoy.config.listener.v3.Listenerz\": \"not found\"" subsys=k8s

... (something like 5 minutes later) ...

level=fatal msg="Timed out waiting for pre-existing resources to be received; exiting" subsys=k8s-watcher

Anything else?

Two half-baked ideas I had about how to address this, if they're useful:

  1. Set up a validation webhook that checks whether the config can be recognized by the various & sundry protobuf types at creation/update time. That has the main advantage of giving direct feedback "what you're trying to do won't work", but the main disadvantage that admission control isn't infallible (especially across versions) so it's maybe not sufficient on its own.
  2. Defer deserialization of the envoy-specific resources (via json.RawMessage or similar) to separate the kubernetes resource listing/watching from the envoy-specific config details. The main downside here is that the system seems to accept the instance of the CR but then doesn't configure envoy with it, and it wasn't clear to me where I'd go to see if convergence was successful or not (I looked for e.g. status.conditions on the CR and events before getting lucky with finding/identifying agent logs).

In either case, a few more events re: the CiliumEnvoyConfig feel like they would've been useful to help me identify what part of the system was having trouble. I was comparing against a working config, so even something as simple as a "ConfigAccepted" event would have helped me narrow down that it was a config-specific problem.

Code of Conduct

jrajahalme commented 2 years ago

@sethp-verica Thanks for the bug report! This is now fixed in https://github.com/cilium/cilium/pull/18894, which I hope merges to Cilium master soon. The fix skips over all the Envoy resources that can't be parsed and logs them on warning level to Cilium agent logs. The rest of the resources are applied as if the failing resource did not exist.