Azure / AKS

Azure Kubernetes Service
1.92k stars 284 forks source link

[BUG] calico-kube-controllers Liveness and Readiness probe failed on Kubernetes version 1.28.5 #4212

Open win5923 opened 1 month ago

win5923 commented 1 month ago

Describe the bug

event message:
Liveness probe failed: command "/usr/bin/check-status -l" timed out
Readiness probe failed: command "/usr/bin/check-status -r" timed out

pod logs have no error messages.

Screenshots image

Environment (please complete the following information):

JoeyC-Dev commented 1 month ago

TLDR: This issue looks reproduceable on both 1.28.5 and 1.27.9.

I created a new AKS with 1.28.5 and check the mcr.microsoft.com/oss/calico/node and mcr.microsoft.com/oss/calico/kube-controllers are exactly 3.24.6.
image

This first error popped up after 12 mins of AKS creation:

KubeEvents
| where Namespace == "calico-system"
| where Message contains "check-status"

image Container log won't give you anything: image Checked the syslog, basically "timeout":

cat syslog | grep check-status
Apr 15 04:28:44 aks-agentpool-34739312-vmss000001 kubelet[2784]: E0415 04:28:44.075497    2784 remote_runtime.go:496] "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="02cb51467976c311b30c8f62b8d35f5e90f5d4857805acd9c5d8a451297947f7" cmd=["/usr/bin/check-status","-l"]
Apr 15 04:28:44 aks-agentpool-34739312-vmss000001 kubelet[2784]: I0415 04:28:44.075547    2784 prober.go:107] "Probe failed" probeType="Liveness" pod="calico-system/calico-kube-controllers-b889487db-ptwld" podUID="c20ed2ff-72a9-4685-8eea-528c7d47ce8b" containerName="calico-kube-controllers" probeResult="failure" output="command \"/usr/bin/check-status -l\" timed out"
Apr 15 04:43:54 aks-agentpool-34739312-vmss000001 kubelet[2784]: E0415 04:43:54.076051    2784 remote_runtime.go:496] "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="02cb51467976c311b30c8f62b8d35f5e90f5d4857805acd9c5d8a451297947f7" cmd=["/usr/bin/check-status","-l"]
Apr 15 04:43:54 aks-agentpool-34739312-vmss000001 kubelet[2784]: I0415 04:43:54.076099    2784 prober.go:107] "Probe failed" probeType="Liveness" pod="calico-system/calico-kube-controllers-b889487db-ptwld" podUID="c20ed2ff-72a9-4685-8eea-528c7d47ce8b" containerName="calico-kube-controllers" probeResult="failure" output="command \"/usr/bin/check-status -l\" timed out"

Don't think anything wrong on calico-node: image

If executing below, you only get "Ready":

for i in {1..100} ; do kubectl exec calico-kube-controllers-b889487db-ptwld -n calico-system -- /usr/bin/check-status -l ; done

Checked on 1.27.9: First error popped up after 20 mins of AKS creation. image

There is a possibility that the timeoutSeconds: 10 is being set to a small value so the according events popped. But it does not look normal for the target to take more than 10s to respond. Maybe this is a bug. But I am not PG so I don't know.

asakth22 commented 4 weeks ago

We're experiencing a similar problem on all of our AKS clusters. The Kubernetes & Calico version is 1.29.0 & v3.26.3 respectively.

DevonK3 commented 2 weeks ago

We are experiencing the same issue as well. AKS: v1.28.5, Calico: image: mcr.microsoft.com/oss/calico/kube-controllers:v3.24.6

aydosman commented 1 week ago

We are being affected by this issue too, observed on multiple clusters

example Kubernetes version: 1.27.9 Architecture: amd64 Operating System: Linux Ubuntu Version: 2204gen2containerd-202403.25.0 Image: mcr.microsoft.com/oss/calico/kube-controllers:v3.24.6

Tadcas commented 3 days ago

The same issue with 1.29.2 and 1.29.4 AKS versions. Both uses mcr.microsoft.com/oss/calico/kube-controllers:v3.26.3 image. Node image version: AKSUbuntu-2204gen2containerd-202403.25.0