Open win5923 opened 1 month ago
TLDR: This issue looks reproduceable on both 1.28.5
and 1.27.9
.
I created a new AKS with 1.28.5 and check the mcr.microsoft.com/oss/calico/node
and mcr.microsoft.com/oss/calico/kube-controllers
are exactly 3.24.6
.
This first error popped up after 12 mins of AKS creation:
KubeEvents
| where Namespace == "calico-system"
| where Message contains "check-status"
Container log won't give you anything: Checked the syslog, basically "timeout":
cat syslog | grep check-status
Apr 15 04:28:44 aks-agentpool-34739312-vmss000001 kubelet[2784]: E0415 04:28:44.075497 2784 remote_runtime.go:496] "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="02cb51467976c311b30c8f62b8d35f5e90f5d4857805acd9c5d8a451297947f7" cmd=["/usr/bin/check-status","-l"]
Apr 15 04:28:44 aks-agentpool-34739312-vmss000001 kubelet[2784]: I0415 04:28:44.075547 2784 prober.go:107] "Probe failed" probeType="Liveness" pod="calico-system/calico-kube-controllers-b889487db-ptwld" podUID="c20ed2ff-72a9-4685-8eea-528c7d47ce8b" containerName="calico-kube-controllers" probeResult="failure" output="command \"/usr/bin/check-status -l\" timed out"
Apr 15 04:43:54 aks-agentpool-34739312-vmss000001 kubelet[2784]: E0415 04:43:54.076051 2784 remote_runtime.go:496] "ExecSync cmd from runtime service failed" err="rpc error: code = DeadlineExceeded desc = context deadline exceeded" containerID="02cb51467976c311b30c8f62b8d35f5e90f5d4857805acd9c5d8a451297947f7" cmd=["/usr/bin/check-status","-l"]
Apr 15 04:43:54 aks-agentpool-34739312-vmss000001 kubelet[2784]: I0415 04:43:54.076099 2784 prober.go:107] "Probe failed" probeType="Liveness" pod="calico-system/calico-kube-controllers-b889487db-ptwld" podUID="c20ed2ff-72a9-4685-8eea-528c7d47ce8b" containerName="calico-kube-controllers" probeResult="failure" output="command \"/usr/bin/check-status -l\" timed out"
Don't think anything wrong on calico-node
:
If executing below, you only get "Ready":
for i in {1..100} ; do kubectl exec calico-kube-controllers-b889487db-ptwld -n calico-system -- /usr/bin/check-status -l ; done
Checked on 1.27.9
: First error popped up after 20 mins of AKS creation.
There is a possibility that the timeoutSeconds: 10
is being set to a small value so the according events popped. But it does not look normal for the target to take more than 10s to respond.
Maybe this is a bug. But I am not PG so I don't know.
We're experiencing a similar problem on all of our AKS clusters. The Kubernetes & Calico version is 1.29.0 & v3.26.3 respectively.
We are experiencing the same issue as well. AKS: v1.28.5, Calico: image: mcr.microsoft.com/oss/calico/kube-controllers:v3.24.6
We are being affected by this issue too, observed on multiple clusters
example Kubernetes version: 1.27.9 Architecture: amd64 Operating System: Linux Ubuntu Version: 2204gen2containerd-202403.25.0 Image: mcr.microsoft.com/oss/calico/kube-controllers:v3.24.6
The same issue with 1.29.2 and 1.29.4 AKS versions. Both uses mcr.microsoft.com/oss/calico/kube-controllers:v3.26.3 image. Node image version: AKSUbuntu-2204gen2containerd-202403.25.0
Describe the bug
pod logs have no error messages.
Screenshots
Environment (please complete the following information):