Closed GuillaumeSmaha closed 3 years ago
Hello @GuillaumeSmaha , I am facing a very similar issue. were you able to find a solution?
Hi @ybialik, not yet. I continue to dig to find more information.
@GuillaumeSmaha thanks, can you please update here in case you find one?
We discussed this yesterday in the Cilium community meeting. We assume this issue here has the same root cause #16375 - we have not identified it yet.
@gandro awesome, can't wait :)
We're trying to reproduce the issue in order to be able to debug it further, but currently with no success.
In case any of you have a test cluster where the problem is reproducible, it would be incredibly helpful if we can gather more information with flow debug logging enabled:
debug: "true"
and debug-verbose: "flow"
to the cilium-config
ConfigMapI wrote a script to loop over connectivity test installation and removal until it got the container creating issue and it did :)
➜ Documents kb get pods -n cilium-test
NAME READY STATUS RESTARTS AGE
echo-a-57cbbd9b8b-cqdpf 1/1 Running 0 2m18s
echo-b-6db5fc8ff8-5v7fw 0/1 ContainerCreating 0 2m17s
echo-b-host-76d89978c-68tmb 1/1 Running 2 2m17s
host-to-b-multi-node-clusterip-fd6868749-gjcnd 0/1 Running 2 2m9s
host-to-b-multi-node-headless-54fbc4659f-lwvgt 0/1 Running 2 2m8s
pod-to-a-648fd74787-z7tsv 1/1 Running 0 2m16s
pod-to-a-allowed-cnp-7776c879f-ptdn4 1/1 Running 0 2m13s
pod-to-a-denied-cnp-b5ff897c7-wzsg9 1/1 Running 0 2m14s
pod-to-b-intra-node-nodeport-7b6cd69659-vg7hj 0/1 Running 2 2m7s
pod-to-b-multi-node-clusterip-7d54c74c5f-65g7c 0/1 Running 2 2m11s
pod-to-b-multi-node-headless-76db68d547-9v9fl 0/1 Running 2 2m10s
pod-to-b-multi-node-nodeport-66ddcd6ff9-7gm67 0/1 Running 2 2m8s
pod-to-external-1111-6d4f9d9645-t8s8l 1/1 Running 0 2m15s
pod-to-external-fqdn-allow-google-cnp-5bc496897c-728h6 1/1 Running 0 2m12s
describing the pod:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 2m11s default-scheduler Successfully assigned cilium-test/echo-b-6db5fc8ff8-5v7fw to ip-10-80-166-77.ec2.internal
Warning FailedCreatePodSandBox 40s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "25bbd2dbfcdf1f1acc2f2d06cf5bd74525dcb163adad82828bcb46bd82fa5a98" network for pod "echo-b-6db5fc8ff8-5v7fw": networkPlugin cni failed to set up pod "echo-b-6db5fc8ff8-5v7fw_cilium-test" network: Unable to create endpoint: Cilium API client timeout exceeded
Normal SandboxChanged 40s kubelet Pod sandbox changed, it will be killed and re-created.
I am creating the sysdump now and will upload it after it is done.
Because it is a large file, sending it using google drive. https://drive.google.com/file/d/1-krHnaRQai0rJgnw5Q7Ke_wjQl7o8UbM/view?usp=sharing
@gandro My current configuration is using:
debug: "true"
debug-verbose: datapath,flow,kvstore,envoy
the cilium sysdump was done with this configuration too: cilium-sysdump-20210628-120117.tar.xz.zip
any update on this?
Cross-posting from https://github.com/cilium/cilium/issues/16375#issuecomment-876968691
This is likely fixed by #16834, would be great if you could validate that.
A workaround would be to not configure
--proxy-prometheus-port
until this fix is in.
thanks @gandro I will see if we can check the workaround
I tried the workaround (disabling prometheus) and it works :+1:
Bug report
Few hours after installation of cilium, one pod of cilium started to report error
regeneration recovery failed
. On this node, it is not possible to schedule new pods until cilium is restarted.cilium status
is reporting errorcontroller endpoint-3282-regeneration-recovery is failing since 15m48s (788x): regeneration recovery failed
Pod is stuck in state
ContainerCreating
kubectl -n cilium-test describe po client-8655c47fd7-s2x9r
:General Information
cilium version
)Linux dc1-k8s-02 5.8.0-55-generic #62~20.04.1-Ubuntu SMP Wed Jun 2 08:55:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc1-k8s-03 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc1-k8s-04 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc1-inf-15 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc3-8gpu-32 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
/¯¯_/¯¯\ Cilium: 1 errors _/¯¯_/ Operator: OK /¯¯_/¯¯\ Hubble: OK _/¯¯_/ ClusterMesh: disabled __/
DaemonSet cilium Desired: 5, Ready: 5/5, Available: 5/5 Deployment cilium-operator Desired: 3, Ready: 3/3, Available: 3/3 Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1 Containers: cilium Running: 5 cilium-operator Running: 3 hubble-relay Running: 1 Image versions hubble-relay quay.io/cilium/hubble-relay:v1.10.0@sha256:e92e6778c71aa9e181618d61e9403761ad061c3960a9203aa2cf8e6cde95c9d7: 1 cilium quay.io/cilium/cilium:v1.10.0@sha256:587627d909ffe0418c0bd907516496844867a21812946af82096d367760e4c1e: 5 cilium-operator quay.io/cilium/operator-generic:v1.10.0@sha256:65143311a62a95dbe23c69ff2f624e0fdf030eb225e6375d889da66a955dd828: 3 Errors: cilium cilium-bn9k4 controller endpoint-3282-regeneration-recovery is failing since 15m48s (788x): regeneration recovery failed
2021-06-28T12:01:17.357326306Z level=warning msg="Creation of endpoint failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 error="timeout while waiting for initial endpoint generation to complete" identity=27434 ipv4= ipv6= k8sPodName=/ subsys=daemon 2021-06-28T12:01:17.357343126Z level=debug msg="End of create request" containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 identity=27434 ipv4= ipv6= k8sPodName=/ subsys=daemon 2021-06-28T12:01:17.357349938Z level=info msg="API call has been processed" error="timeout while waiting for initial endpoint generation to complete" name=endpoint-create processingDuration=5m30.059449528s subsys=rate totalDuration=5m30.059535635s uuid=d83be27b-4db7-4c1f-8ec7-2cc5ce761214 waitDurationTotal="64.286µs"
2021-06-28T12:01:17.300002355Z level=debug msg="Completed endpoint regeneration with no pending regeneration requests" code=OK containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 endpointState=ready identity=27434 ipv4= ipv6= k8sPodName=/ policyRevision=0 subsys=endpoint type=0 2021-06-28T12:01:17.300037547Z level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=152.954367ms bpfWaitForELF="4.658µs" bpfWriteELF="899.396µs" containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 error="Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded" identity=27434 ipv4= ipv6= k8sPodName=/ mapSync="137.402µs" policyCalculation="80.937µs" prepareBuild="423.455µs" proxyConfiguration="10.854µs" proxyPolicyCalculation="51.7µs" proxyWaitForAck=5m29.792614184s reason="updated security labels" subsys=endpoint total=5m30.000731126s waitingForCTClean=51.258622ms waitingForLock=718ns 2021-06-28T12:01:17.300051594Z level=debug msg="Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded" code=Failure containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 endpointState=ready identity=27434 ipv4= ipv6= k8sPodName=/ policyRevision=0 subsys=endpoint type=200
2021-06-28T12:00:37.077455595Z level=debug msg="Distributed lock successful, etcd has quorum" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore 2021-06-28T12:00:37.077507299Z level=debug msg="Checking status to etcd endpoint https://192.168.170.210:2279" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore 2021-06-28T12:00:37.087813398Z level=debug msg="Checking status to etcd endpoint https://192.168.170.211:2279" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore 2021-06-28T12:00:37.098259674Z level=debug msg="Checking status to etcd endpoint https://192.168.170.28:2279" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore
level=debug msg="Received update notification of heartbeat" subsys=kvstore