cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
19.92k stars 2.92k forks source link

Pod creating blocked with `Error regenerating endpoint: proxy state changes failed` #16667

Closed GuillaumeSmaha closed 3 years ago

GuillaumeSmaha commented 3 years ago

Bug report

Few hours after installation of cilium, one pod of cilium started to report error regeneration recovery failed. On this node, it is not possible to schedule new pods until cilium is restarted.

cilium status is reporting error controller endpoint-3282-regeneration-recovery is failing since 15m48s (788x): regeneration recovery failed

Pod is stuck in state ContainerCreating kubectl -n cilium-test describe po client-8655c47fd7-s2x9r:

  Warning  FailedCreatePodSandBox  39m                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e5e8e6995eb73081cb01b908dca450ddedb749ce3a045ef0a3f87f9b232e9849": Unable to create endpoint: Cilium API client timeout exceeded
  Warning  FailedCreatePodSandBox  76s (x22 over 37m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "90a4f585a4e543edf81e37f0305d9cbcb5b1f75308303c2ecf079800cc5409e2": Unable to create endpoint: Cilium API client timeout exceeded

General Information

- Kernel version (run `uname -a`)

Linux dc1-k8s-02 5.8.0-55-generic #62~20.04.1-Ubuntu SMP Wed Jun 2 08:55:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc1-k8s-03 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc1-k8s-04 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc1-inf-15 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc3-8gpu-32 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

- Orchestration system version in use (e.g. `kubectl version`, ...)

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

- Configuration:

  Nodes:
  - 3 masters
  - 2 nodes

  Cilium:
  - vxlan enabled
  - etcd cluster for cilium (one instance by master)
  - 3 operators (one by master)
- Cilium Status
/¯¯\

/¯¯_/¯¯\ Cilium: 1 errors _/¯¯_/ Operator: OK /¯¯_/¯¯\ Hubble: OK _/¯¯_/ ClusterMesh: disabled __/

DaemonSet cilium Desired: 5, Ready: 5/5, Available: 5/5 Deployment cilium-operator Desired: 3, Ready: 3/3, Available: 3/3 Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1 Containers: cilium Running: 5 cilium-operator Running: 3 hubble-relay Running: 1 Image versions hubble-relay quay.io/cilium/hubble-relay:v1.10.0@sha256:e92e6778c71aa9e181618d61e9403761ad061c3960a9203aa2cf8e6cde95c9d7: 1 cilium quay.io/cilium/cilium:v1.10.0@sha256:587627d909ffe0418c0bd907516496844867a21812946af82096d367760e4c1e: 5 cilium-operator quay.io/cilium/operator-generic:v1.10.0@sha256:65143311a62a95dbe23c69ff2f624e0fdf030eb225e6375d889da66a955dd828: 3 Errors: cilium cilium-bn9k4 controller endpoint-3282-regeneration-recovery is failing since 15m48s (788x): regeneration recovery failed

- Link to relevant artifacts (policies, deployments scripts, ...)
No policy.
Deployment of treafik and metallb
- Generate and upload a system zip:

(Have to use `xz` to reduce size from 44MB to 16MB)
[cilium-sysdump-20210628-120117.tar.xz.zip](https://github.com/cilium/cilium/files/6726908/cilium-sysdump-20210628-120117.tar.xz.zip)

**How to reproduce the issue**

Runs connectivity tests twice:
- `kubectl delete ns cilium-test ; cilium connectivity test`
- `kubectl delete ns cilium-test ; cilium connectivity test`

When cilium pod is restarted, it temporary fixes the issue.

**Details**

Cilium pod correctly receives the request but it is failing with:
`kubectl -n kube-system logs cilium-bn9k4`

2021-06-28T12:01:17.357326306Z level=warning msg="Creation of endpoint failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 error="timeout while waiting for initial endpoint generation to complete" identity=27434 ipv4= ipv6= k8sPodName=/ subsys=daemon 2021-06-28T12:01:17.357343126Z level=debug msg="End of create request" containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 identity=27434 ipv4= ipv6= k8sPodName=/ subsys=daemon 2021-06-28T12:01:17.357349938Z level=info msg="API call has been processed" error="timeout while waiting for initial endpoint generation to complete" name=endpoint-create processingDuration=5m30.059449528s subsys=rate totalDuration=5m30.059535635s uuid=d83be27b-4db7-4c1f-8ec7-2cc5ce761214 waitDurationTotal="64.286µs"


Regeneration is failing with:
`kubectl -n kube-system logs cilium-bn9k4`

2021-06-28T12:01:17.300002355Z level=debug msg="Completed endpoint regeneration with no pending regeneration requests" code=OK containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 endpointState=ready identity=27434 ipv4= ipv6= k8sPodName=/ policyRevision=0 subsys=endpoint type=0 2021-06-28T12:01:17.300037547Z level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=152.954367ms bpfWaitForELF="4.658µs" bpfWriteELF="899.396µs" containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 error="Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded" identity=27434 ipv4= ipv6= k8sPodName=/ mapSync="137.402µs" policyCalculation="80.937µs" prepareBuild="423.455µs" proxyConfiguration="10.854µs" proxyPolicyCalculation="51.7µs" proxyWaitForAck=5m29.792614184s reason="updated security labels" subsys=endpoint total=5m30.000731126s waitingForCTClean=51.258622ms waitingForLock=718ns 2021-06-28T12:01:17.300051594Z level=debug msg="Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded" code=Failure containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 endpointState=ready identity=27434 ipv4= ipv6= k8sPodName=/ policyRevision=0 subsys=endpoint type=200


I checked for etcd connection:

2021-06-28T12:00:37.077455595Z level=debug msg="Distributed lock successful, etcd has quorum" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore 2021-06-28T12:00:37.077507299Z level=debug msg="Checking status to etcd endpoint https://192.168.170.210:2279" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore 2021-06-28T12:00:37.087813398Z level=debug msg="Checking status to etcd endpoint https://192.168.170.211:2279" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore 2021-06-28T12:00:37.098259674Z level=debug msg="Checking status to etcd endpoint https://192.168.170.28:2279" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore


`cilium` & `cilium-operator` pods receives heartbeat notification:

level=debug msg="Received update notification of heartbeat" subsys=kvstore

ybialik commented 3 years ago

Hello @GuillaumeSmaha , I am facing a very similar issue. were you able to find a solution?

GuillaumeSmaha commented 3 years ago

Hi @ybialik, not yet. I continue to dig to find more information.

ybialik commented 3 years ago

@GuillaumeSmaha thanks, can you please update here in case you find one?

gandro commented 3 years ago

We discussed this yesterday in the Cilium community meeting. We assume this issue here has the same root cause #16375 - we have not identified it yet.

ybialik commented 3 years ago

@gandro awesome, can't wait :)

gandro commented 3 years ago

We're trying to reproduce the issue in order to be able to debug it further, but currently with no success.

In case any of you have a test cluster where the problem is reproducible, it would be incredibly helpful if we can gather more information with flow debug logging enabled:

  1. Add debug: "true" and debug-verbose: "flow" to the cilium-config ConfigMap
  2. Restart the cilium pods for the above settings to be applied
  3. Reproduce the issue by re-running the connectivity test until pods are stuck pending
  4. Capture a sysdump
ybialik commented 3 years ago

I wrote a script to loop over connectivity test installation and removal until it got the container creating issue and it did :)

➜  Documents kb get pods -n cilium-test
NAME                                                     READY   STATUS              RESTARTS   AGE
echo-a-57cbbd9b8b-cqdpf                                  1/1     Running             0          2m18s
echo-b-6db5fc8ff8-5v7fw                                  0/1     ContainerCreating   0          2m17s
echo-b-host-76d89978c-68tmb                              1/1     Running             2          2m17s
host-to-b-multi-node-clusterip-fd6868749-gjcnd           0/1     Running             2          2m9s
host-to-b-multi-node-headless-54fbc4659f-lwvgt           0/1     Running             2          2m8s
pod-to-a-648fd74787-z7tsv                                1/1     Running             0          2m16s
pod-to-a-allowed-cnp-7776c879f-ptdn4                     1/1     Running             0          2m13s
pod-to-a-denied-cnp-b5ff897c7-wzsg9                      1/1     Running             0          2m14s
pod-to-b-intra-node-nodeport-7b6cd69659-vg7hj            0/1     Running             2          2m7s
pod-to-b-multi-node-clusterip-7d54c74c5f-65g7c           0/1     Running             2          2m11s
pod-to-b-multi-node-headless-76db68d547-9v9fl            0/1     Running             2          2m10s
pod-to-b-multi-node-nodeport-66ddcd6ff9-7gm67            0/1     Running             2          2m8s
pod-to-external-1111-6d4f9d9645-t8s8l                    1/1     Running             0          2m15s
pod-to-external-fqdn-allow-google-cnp-5bc496897c-728h6   1/1     Running             0          2m12s

describing the pod:

Events:
  Type     Reason                  Age    From               Message
  ----     ------                  ----   ----               -------
  Normal   Scheduled               2m11s  default-scheduler  Successfully assigned cilium-test/echo-b-6db5fc8ff8-5v7fw to ip-10-80-166-77.ec2.internal
  Warning  FailedCreatePodSandBox  40s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "25bbd2dbfcdf1f1acc2f2d06cf5bd74525dcb163adad82828bcb46bd82fa5a98" network for pod "echo-b-6db5fc8ff8-5v7fw": networkPlugin cni failed to set up pod "echo-b-6db5fc8ff8-5v7fw_cilium-test" network: Unable to create endpoint: Cilium API client timeout exceeded
  Normal   SandboxChanged          40s    kubelet            Pod sandbox changed, it will be killed and re-created.

I am creating the sysdump now and will upload it after it is done.

ybialik commented 3 years ago

Because it is a large file, sending it using google drive. https://drive.google.com/file/d/1-krHnaRQai0rJgnw5Q7Ke_wjQl7o8UbM/view?usp=sharing

GuillaumeSmaha commented 3 years ago

@gandro My current configuration is using:

  debug: "true"
  debug-verbose: datapath,flow,kvstore,envoy

the cilium sysdump was done with this configuration too: cilium-sysdump-20210628-120117.tar.xz.zip

ybialik commented 3 years ago

any update on this?

gandro commented 3 years ago

Cross-posting from https://github.com/cilium/cilium/issues/16375#issuecomment-876968691

This is likely fixed by #16834, would be great if you could validate that.

A workaround would be to not configure --proxy-prometheus-port until this fix is in.

ybialik commented 3 years ago

thanks @gandro I will see if we can check the workaround

GuillaumeSmaha commented 3 years ago

I tried the workaround (disabling prometheus) and it works :+1: