Closed cablunar closed 6 months ago
Hi, a few observations:
⚠️ cniconflist-cilium-t6tpd: error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout
It looks like the Cilium CLI is also having trouble connecting to the k8s control plane. How is Cilium CLI getting these IPs to attempt to connect to the control plane? Maybe the CLI is having a similar problem to the Cilium DaemonSet in the cluster?
Timeout waiting for response to forwarded proxied DNS lookup, count=247019
It looks like the Pods are making DNS requests to a kube-dns instance that has been taken down. Probably what happens is that the Pod connects to the virtual IP of the kube-dns service, then Cilium is translating those requests to an old kube-dns server because Cilium hasn't successfully migrated to the new control plane yet. I suspect this is a symptom of the core problem.
I don't believe that there are tests in the tree for migrating the Kubernetes control plane, but I agree that covering this test case would be useful for the project. I've filed https://github.com/cilium/cilium/issues/31140 to track that, since it would help to narrow down the root cause of the problem here.
Maybe something that could help is to get the full output of kubectl get endpoints -o yaml
and kubectl -n kube-system get endpoints -o yaml
to confirm whether the k8s services are fully migrated over in k8s objects?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has not seen any activity since it was marked stale. Closing.
Is there an existing issue for this?
What happened?
timeout i/o
errors appear from the logs of pods running and connecting to external and internal endpointsDebugging on Cilium:
Cilium Version
1.14.7 generated from the cilium helm chart with the following parameters:
Kernel Version
Linux ip-10-0-39-88 5.15.136-flatcar
Kubernetes Version
1.28.3
Regression
No response
Sysdump
🔍 Collecting sysdump with cilium-cli version: v0.12.3, args: [sysdump] 🔍 Collecting Kubernetes nodes 🔍 Collect Kubernetes nodes 🔍 Collecting Kubernetes events 🔍 Collecting Kubernetes pods 🔍 Collecting Kubernetes services 🔍 Collecting Kubernetes namespaces 🔍 Collect Kubernetes version 🔍 Collecting Kubernetes pods summary 🔍 Collecting Kubernetes endpoints 🔍 Collecting Kubernetes network policies 🔍 Collecting Cilium cluster-wide network policies 🔍 Collecting Cilium network policies 🔍 Collecting Cilium local redirect policies 🔍 Collecting Cilium egress NAT policies 🔍 Collecting Cilium identities 🔍 Collecting Cilium endpoints 🔍 Collecting Cilium nodes 🔍 Collecting Ingresses 🔍 Collecting CiliumClusterwideEnvoyConfigs 🔍 Collecting CiliumEnvoyConfigs 🔍 Collecting Cilium etcd secret 🔍 Collecting the Cilium configuration 🔍 Collecting the Cilium daemonset(s) 🔍 Collecting the Hubble daemonset 🔍 Collecting the Hubble Relay configuration 🔍 Collecting the Hubble Relay deployment 🔍 Collecting the Cilium operator deployment 🔍 Collecting the Hubble UI deployment 🔍 Collecting the 'clustermesh-apiserver' deployment 🔍 Collecting the CNI configuration files from Cilium pods 🔍 Collecting the CNI configmap 🔍 Collecting gops stats from Cilium pods 🔍 Collecting gops stats from Hubble pods 🔍 Collecting gops stats from Hubble Relay pods ⚠️ Deployment "clustermesh-apiserver" not found in namespace "kube-system" - this is expected if 'clustermesh-apiserver' isn't enabled 🔍 Collecting bugtool output from Cilium pods 🔍 Collecting logs from Cilium pods 🔍 Collecting logs from 'clustermesh-apiserver' pods 🔍 Collecting logs from Cilium operator pods 🔍 Collecting logs from Hubble pods 🔍 Collecting logs from Hubble Relay pods 🔍 Collecting logs from Hubble UI pods 🔍 Collecting bugtool output from Tetragon pods 🔍 Collecting platform-specific data 🔍 Collecting kvstore data 🔍 Collecting Hubble flows from Cilium pods ⚠️ The following tasks failed, the sysdump may be incomplete: ⚠️ [11] Collecting Cilium egress NAT policies: failed to collect Cilium egress NAT policies: the server could not find the requested resource (get ciliumegressnatpolicies.cilium.io) ⚠️ [12] Collecting Cilium local redirect policies: failed to collect Cilium local redirect policies: the server could not find the requested resource (get ciliumlocalredirectpolicies.cilium.io) ⚠️ [17] Collecting CiliumClusterwideEnvoyConfigs: failed to collect CiliumClusterwideEnvoyConfigs: the server could not find the requested resource (get ciliumclusterwideenvoyconfigs.cilium.io) ⚠️ [18] Collecting CiliumEnvoyConfigs: failed to collect CiliumEnvoyConfigs: the server could not find the requested resource (get ciliumenvoyconfigs.cilium.io) ⚠️ cniconflist-cilium-t6tpd: error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout ⚠️ cilium-bugtool-cilium-t6tpd: failed to collect 'cilium-bugtool' output for "cilium-t6tpd" in namespace "kube-system": error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout: ⚠️ hubble-flows-cilium-t6tpd: failed to collect hubble flows for "cilium-t6tpd" in namespace "kube-system": error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout: ⚠️ gops-cilium-t6tpd-memstats: failed to list processes "cilium-t6tpd" ("cilium-agent") in namespace "kube-system": error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout ⚠️ gops-cilium-t6tpd-stack: failed to list processes "cilium-t6tpd" ("cilium-agent") in namespace "kube-system": error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout ⚠️ gops-cilium-t6tpd-stats: failed to list processes "cilium-t6tpd" ("cilium-agent") in namespace "kube-system": error dialing backend: dial tcp 10.1.17.193:10250: i/o timeout ⚠️ logs-cilium-t6tpd-cilium-agent: failed to collect logs for "cilium-t6tpd" ("cilium-agent") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-config: failed to collect logs for "cilium-t6tpd" ("config") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-mount-cgroup: failed to collect logs for "cilium-t6tpd" ("mount-cgroup") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-apply-sysctl-overwrites: failed to collect logs for "cilium-t6tpd" ("apply-sysctl-overwrites") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-mount-bpf-fs: failed to collect logs for "cilium-t6tpd" ("mount-bpf-fs") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-clean-cilium-state: failed to collect logs for "cilium-t6tpd" ("clean-cilium-state") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ logs-cilium-t6tpd-install-cni-binaries: failed to collect logs for "cilium-t6tpd" ("install-cni-binaries") in namespace "kube-system": pods "ip-10-1-17-193.eu-west-1.compute.internal" not found ⚠️ Please note that depending on your Cilium version and installation options, this may be expected 🗳 Compiling sysdump
Relevant log output
Anything else?
Work-a-round: Restart the Cilium daemon set after the last control plane node is replaced and running. Causes network traffic to partially work for 10 mins before all the cilium pods are restarted.
Cilium Users Document
Code of Conduct