Cilium stops serving network traffic when Kubernetes control-plane nodes are replaced

cablunar commented 8 months ago

Is there an existing issue for this?

[X] I have searched the existing issues

What happened?

A rollout replacement of the control-planes is started, creating new nodes that joins the cluster
When the last control plane was replaced, the network stopped working partially
A lot of timeout i/o errors appear from the logs of pods running and connecting to external and internal endpoints
Hitting external ingress objects, gives the user 503 or 504 error messages served by an nginx

Debugging on Cilium:

The CPU increases during the incident
The maSync and policyCalculation time increases
The update endpoint increases

Cilium Version

1.14.7 generated from the cilium helm chart with the following parameters:

    --set ipam.operator.clusterPoolIPv4PodCIDRList="10.16.0.0/16" \
    --set hubble.enabled=true \
    --set dnsProxy.endpointMaxIpPerHostname=100 \
    --set hubble.listenAddress=":4244" \
    --set hubble.relay.enabled=true \
    --set hubble.ui.enabled=true \
    --set prometheus.enabled=true \
    --set operator.prometheus.enabled=true \
    --set hubble.metrics.enableOpenMetrics=true \
    --set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,httpV2:exemplars=true;labelsContext=source_ip\,source_namespace\,source_workload\,destination_ip\,destination_namespace\,destination_workload\,traffic_direction}" \
    --set dnsProxy.proxyResponseMaxDelay="200ms" \
    --set bpf.policyMapMax=32768 \

Kernel Version

Linux ip-10-0-39-88 5.15.136-flatcar

Kubernetes Version

1.28.3

Regression

No response

Sysdump

🔍 Collecting sysdump with cilium-cli version: v0.12.3, args: [sysdump] 🔍 Collecting Kubernetes nodes 🔍 Collect Kubernetes nodes 🔍 Collecting Kubernetes events 🔍 Collecting Kubernetes pods 🔍 Collecting Kubernetes services 🔍 Collecting Kubernetes namespaces 🔍 Collect Kubernetes version 🔍 Collecting Kubernetes pods summary 🔍 Collecting Kubernetes endpoints 🔍 Collecting Kubernetes network policies 🔍 Collecting Cilium cluster-wide network policies 🔍 Collecting Cilium network policies 🔍 Collecting Cilium local redirect policies 🔍 Collecting Cilium egress NAT policies 🔍 Collecting Cilium identities 🔍 Collecting Cilium endpoints 🔍 Collecting Cilium nodes 🔍 Collecting Ingresses 🔍 Collecting CiliumClusterwideEnvoyConfigs 🔍 Collecting CiliumEnvoyConfigs 🔍 Collecting Cilium etcd secret 🔍 Collecting the Cilium configuration 🔍 Collecting the Cilium daemonset(s) 🔍 Collecting the Hubble daemonset 🔍 Collecting the Hubble Relay configuration 🔍 Collecting the Hubble Relay deployment 🔍 Collecting the Cilium operator deployment 🔍 Collecting the Hubble UI deployment 🔍 Collecting the 'clustermesh-apiserver' deployment 🔍 Collecting the CNI configuration files from Cilium pods 🔍 Collecting the CNI configmap 🔍 Collecting gops stats from Cilium pods 🔍 Collecting gops stats from Hubble pods 🔍 Collecting gops stats from Hubble Relay pods ⚠️ Deployment "clustermesh-apiserver" not found in namespace "kube-system" - this is expected if 'clustermesh-apiserver' isn't enabled 🔍 Collecting bugtool output from Cilium pods 🔍 Collecting logs from Cilium pods 🔍 Collecting logs from 'clustermesh-apiserver' pods 🔍 Collecting logs from Cilium operator pods 🔍 Collecting logs from Hubble pods 🔍 Collecting logs from Hubble Relay pods 🔍 Collecting logs from Hubble UI pods 🔍 Collecting bugtool output from Tetragon pods 🔍 Collecting platform-specific data 🔍 Collecting kvstore data 🔍 Collecting Hubble flows from Cilium pods ⚠️ The following tasks failed, the sysdump may be incomplete: ⚠️ [11] Collecting Cilium egress NAT policies: failed to collect Cilium egress NAT policies: the server could not find the requested resource (get ciliumegressnatpolicies.cilium.io) ⚠️ [12] Collecting Cilium local redirect policies: failed to collect Cilium local redirect policies: the server could not find the requested resource (get ciliumlocalredirectpolicies.cilium.io) ⚠️ [17] Collecting CiliumClusterwideEnvoyConfigs: failed to collect CiliumClusterwideEnvoyConfigs: the server could not find the requested resource (get ciliumclusterwideenvoyconfigs.cilium.io) ⚠️ [18] Collecting CiliumEnvoyConfigs: failed to collect CiliumEnvoyConfigs: the server could not find the requested resource (get ciliumenvoyconfigs.cilium.io) ⚠️ cniconflist-cilium-t6tpd: error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout ⚠️ cilium-bugtool-cilium-t6tpd: failed to collect 'cilium-bugtool' output for "cilium-t6tpd" in namespace "kube-system": error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout: ⚠️ hubble-flows-cilium-t6tpd: failed to collect hubble flows for "cilium-t6tpd" in namespace "kube-system": error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout: ⚠️ gops-cilium-t6tpd-memstats: failed to list processes "cilium-t6tpd" ("cilium-agent") in namespace "kube-system": error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout ⚠️ gops-cilium-t6tpd-stack: failed to list processes "cilium-t6tpd" ("cilium-agent") in namespace "kube-system": error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout ⚠️ gops-cilium-t6tpd-stats: failed to list processes "cilium-t6tpd" ("cilium-agent") in namespace "kube-system": error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout ⚠️ logs-cilium-t6tpd-cilium-agent: failed to collect logs for "cilium-t6tpd" ("cilium-agent") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-config: failed to collect logs for "cilium-t6tpd" ("config") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-mount-cgroup: failed to collect logs for "cilium-t6tpd" ("mount-cgroup") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-apply-sysctl-overwrites: failed to collect logs for "cilium-t6tpd" ("apply-sysctl-overwrites") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-mount-bpf-fs: failed to collect logs for "cilium-t6tpd" ("mount-bpf-fs") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-clean-cilium-state: failed to collect logs for "cilium-t6tpd" ("clean-cilium-state") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-install-cni-binaries: failed to collect logs for "cilium-t6tpd" ("install-cni-binaries") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ Please note that depending on your Cilium version and installation options, this may be expected 🗳 Compiling sysdump

Relevant log output

The timeout error happens multiple times during the incident.
IP: 10.96.0.1  is the default kubernetes service object.
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP

2024-02-28T14:21:36.880778241Z level=error msg="error retrieving resource lock kube-system/cilium-operator-resource-lock: Get \"https://10.96.0.1:443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/cilium-operator-resource-lock?timeout=5s\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)" subsys=klog
2024-02-28T14:21:45.042062825Z level=warning msg="Network status error received, restarting client connections" error="an error on the server (\"[+]ping ok\\n[+]log ok\\n[-]etcd failed: reason withheld\\n[-]kms-provider-0 failed: reason withheld\\n[+]poststarthook/start-kube-apiserver-admission-initializer ok\\n[+]poststarthook/generic-apiserver-start-informers ok\\n[+]poststarthook/priority-and-fairness-config-consumer ok\\n[+]poststarthook/priority-and-fairness-filter ok\\n[+]poststarthook/storage-object-count-tracker-hook ok\\n[+]poststarthook/start-apiextensions-informers ok\\n[+]poststarthook/start-apiextensions-controllers ok\\n[+]poststarthook/crd-informer-synced ok\\n[+]poststarthook/start-service-ip-repair-controllers ok\\n[+]poststarthook/rbac/bootstrap-roles ok\\n[+]poststarthook/scheduling/bootstrap-system-priority-classes ok\\n[+]poststarthook/priority-and-fairness-config-producer ok\\n[+]poststarthook/start-system-namespaces-controller ok\\n[+]poststarthook/bootstrap-controller ok\\n[+]poststarthook/start-cluster-authentication-info-controller ok\\n[+]poststarthook/start-kube-apiserver-identity-lease-controller ok\\n[+]poststarthook/start-deprecated-kube-apiserver-identity-lease-garbage-collector ok\\n[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok\\n[+]poststarthook/start-legacy-token-tracking-controller ok\\n[+]poststarthook/aggregator-reload-proxy-client-cert ok\\n[+]poststarthook/start-kube-aggregator-informers ok\\n[+]poststarthook/apiservice-registration-controller ok\\n[+]poststarthook/apiservice-status-available-controller ok\\n[+]poststarthook/kube-apiserver-autoregistration ok\\n[+]autoregister-completion ok\\n[+]poststarthook/apiservice-openapi-controller ok\\n[+]poststarthook/apiservice-openapiv3-controller ok\\n[+]poststarthook/apiservice-discovery-controller ok\\nhealthz check failed\") has prevented the request from succeeding (get healthz.meta.k8s.io)" subsys=k8s-client

---

Cilium logs the following during the network outage:

Timeout waiting for response to forwarded proxied DNS lookup, count=247019
Cannot forward proxied DNS lookup, count=313
unable to queue endpoint build, count=92
Unable to fetch kubernetes labels, count=74

Anything else?

Work-a-round: Restart the Cilium daemon set after the last control plane node is replaced and running. Causes network traffic to partially work for 10 mins before all the cilium pods are restarted.

Cilium Users Document

[ ] Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

[X] I agree to follow this project's Code of Conduct

joestringer commented 8 months ago

Hi, a few observations:

⚠️ cniconflist-cilium-t6tpd: error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout

It looks like the Cilium CLI is also having trouble connecting to the k8s control plane. How is Cilium CLI getting these IPs to attempt to connect to the control plane? Maybe the CLI is having a similar problem to the Cilium DaemonSet in the cluster?

Timeout waiting for response to forwarded proxied DNS lookup, count=247019

It looks like the Pods are making DNS requests to a kube-dns instance that has been taken down. Probably what happens is that the Pod connects to the virtual IP of the kube-dns service, then Cilium is translating those requests to an old kube-dns server because Cilium hasn't successfully migrated to the new control plane yet. I suspect this is a symptom of the core problem.

I don't believe that there are tests in the tree for migrating the Kubernetes control plane, but I agree that covering this test case would be useful for the project. I've filed https://github.com/cilium/cilium/issues/31140 to track that, since it would help to narrow down the root cause of the problem here.

Maybe something that could help is to get the full output of kubectl get endpoints -o yaml and kubectl -n kube-system get endpoints -o yaml to confirm whether the k8s services are fully migrated over in k8s objects?

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 6 months ago

This issue has not seen any activity since it was marked stale. Closing.

cilium / cilium