Pod creating blocked with `Error regenerating endpoint: proxy state changes failed`

GuillaumeSmaha commented 3 years ago

Bug report

Few hours after installation of cilium, one pod of cilium started to report error regeneration recovery failed. On this node, it is not possible to schedule new pods until cilium is restarted.

cilium status is reporting error controller endpoint-3282-regeneration-recovery is failing since 15m48s (788x): regeneration recovery failed

Pod is stuck in state ContainerCreating kubectl -n cilium-test describe po client-8655c47fd7-s2x9r:

  Warning  FailedCreatePodSandBox  39m                 kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e5e8e6995eb73081cb01b908dca450ddedb749ce3a045ef0a3f87f9b232e9849": Unable to create endpoint: Cilium API client timeout exceeded
  Warning  FailedCreatePodSandBox  76s (x22 over 37m)  kubelet            (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "90a4f585a4e543edf81e37f0305d9cbcb5b1f75308303c2ecf079800cc5409e2": Unable to create endpoint: Cilium API client timeout exceeded

General Information

Cilium version (run cilium version)


Client: 1.10.0 952d9d3 2021-05-19T18:42:32+02:00 go version go1.16.4 linux/amd64
Daemon: 1.10.0 952d9d3 2021-05-19T18:42:32+02:00 go version go1.16.4 linux/amd64

- Kernel version (run `uname -a`)

Linux dc1-k8s-02 5.8.0-55-generic #62~20.04.1-Ubuntu SMP Wed Jun 2 08:55:04 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc1-k8s-03 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc1-k8s-04 5.8.0-44-generic #50~20.04.1-Ubuntu SMP Wed Feb 10 21:07:30 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc1-inf-15 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux Linux dc3-8gpu-32 5.4.0-74-generic #83~18.04.1-Ubuntu SMP Tue May 11 16:01:00 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

- Orchestration system version in use (e.g. `kubectl version`, ...)

Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:18:45Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-12T14:12:29Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}

- Configuration:

  Nodes:
  - 3 masters
  - 2 nodes

  Cilium:
  - vxlan enabled
  - etcd cluster for cilium (one instance by master)
  - 3 operators (one by master)
- Cilium Status

/¯¯\

/¯¯_/¯¯\ Cilium: 1 errors _/¯¯_/ Operator: OK /¯¯_/¯¯\ Hubble: OK _/¯¯_/ ClusterMesh: disabled __/

DaemonSet cilium Desired: 5, Ready: 5/5, Available: 5/5 Deployment cilium-operator Desired: 3, Ready: 3/3, Available: 3/3 Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1 Containers: cilium Running: 5 cilium-operator Running: 3 hubble-relay Running: 1 Image versions hubble-relay quay.io/cilium/hubble-relay:v1.10.0@sha256:e92e6778c71aa9e181618d61e9403761ad061c3960a9203aa2cf8e6cde95c9d7: 1 cilium quay.io/cilium/cilium:v1.10.0@sha256:587627d909ffe0418c0bd907516496844867a21812946af82096d367760e4c1e: 5 cilium-operator quay.io/cilium/operator-generic:v1.10.0@sha256:65143311a62a95dbe23c69ff2f624e0fdf030eb225e6375d889da66a955dd828: 3 Errors: cilium cilium-bn9k4 controller endpoint-3282-regeneration-recovery is failing since 15m48s (788x): regeneration recovery failed

- Link to relevant artifacts (policies, deployments scripts, ...)
No policy.
Deployment of treafik and metallb
- Generate and upload a system zip:

(Have to use `xz` to reduce size from 44MB to 16MB)
[cilium-sysdump-20210628-120117.tar.xz.zip](https://github.com/cilium/cilium/files/6726908/cilium-sysdump-20210628-120117.tar.xz.zip)

**How to reproduce the issue**

Runs connectivity tests twice:
- `kubectl delete ns cilium-test ; cilium connectivity test`
- `kubectl delete ns cilium-test ; cilium connectivity test`

When cilium pod is restarted, it temporary fixes the issue.

**Details**

Cilium pod correctly receives the request but it is failing with:
`kubectl -n kube-system logs cilium-bn9k4`

2021-06-28T12:01:17.357326306Z level=warning msg="Creation of endpoint failed" containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 error="timeout while waiting for initial endpoint generation to complete" identity=27434 ipv4= ipv6= k8sPodName=/ subsys=daemon 2021-06-28T12:01:17.357343126Z level=debug msg="End of create request" containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 identity=27434 ipv4= ipv6= k8sPodName=/ subsys=daemon 2021-06-28T12:01:17.357349938Z level=info msg="API call has been processed" error="timeout while waiting for initial endpoint generation to complete" name=endpoint-create processingDuration=5m30.059449528s subsys=rate totalDuration=5m30.059535635s uuid=d83be27b-4db7-4c1f-8ec7-2cc5ce761214 waitDurationTotal="64.286µs"


Regeneration is failing with:
`kubectl -n kube-system logs cilium-bn9k4`

2021-06-28T12:01:17.300002355Z level=debug msg="Completed endpoint regeneration with no pending regeneration requests" code=OK containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 endpointState=ready identity=27434 ipv4= ipv6= k8sPodName=/ policyRevision=0 subsys=endpoint type=0 2021-06-28T12:01:17.300037547Z level=warning msg="Regeneration of endpoint failed" bpfCompilation=0s bpfLoadProg=152.954367ms bpfWaitForELF="4.658µs" bpfWriteELF="899.396µs" containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 error="Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded" identity=27434 ipv4= ipv6= k8sPodName=/ mapSync="137.402µs" policyCalculation="80.937µs" prepareBuild="423.455µs" proxyConfiguration="10.854µs" proxyPolicyCalculation="51.7µs" proxyWaitForAck=5m29.792614184s reason="updated security labels" subsys=endpoint total=5m30.000731126s waitingForCTClean=51.258622ms waitingForLock=718ns 2021-06-28T12:01:17.300051594Z level=debug msg="Error regenerating endpoint: Error while configuring proxy redirects: proxy state changes failed: context deadline exceeded" code=Failure containerID= datapathPolicyRevision=0 desiredPolicyRevision=73 endpointID=3757 endpointState=ready identity=27434 ipv4= ipv6= k8sPodName=/ policyRevision=0 subsys=endpoint type=200


I checked for etcd connection:

2021-06-28T12:00:37.077455595Z level=debug msg="Distributed lock successful, etcd has quorum" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore 2021-06-28T12:00:37.077507299Z level=debug msg="Checking status to etcd endpoint https://192.168.170.210:2279" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore 2021-06-28T12:00:37.087813398Z level=debug msg="Checking status to etcd endpoint https://192.168.170.211:2279" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore 2021-06-28T12:00:37.098259674Z level=debug msg="Checking status to etcd endpoint https://192.168.170.28:2279" config=/var/lib/etcd-config/etcd.config endpoints="[https://192.168.170.210:2279 https://192.168.170.211:2279 https://192.168.170.28:2279]" subsys=kvstore


`cilium` & `cilium-operator` pods receives heartbeat notification:

level=debug msg="Received update notification of heartbeat" subsys=kvstore