Open noxwuan opened 1 month ago
We are also seeing familiar behavior using strimzi kafka and the redis helm chart.
It appears that the pods are ready to go yet aren't marked ready in kubernetes until 5-10 minutes later. The kubernetes event logs show no failing probes or anything concerning. These pods have been live in multiple clusters over years and haven't had any issues until this week.
We have seen this on AKS kubernetes 1.28.5 and 1.29.2 using Ubuntu 22.04.4 on Kernel 5.15.0-1059-azure. I've noticed this does correlate with a node upgrade but I can't pinpoint that issue.
There is a containerd regression that seems to be related, I wonder if your node is on containderd 1.17.5?
There is a containerd regression that seems to be related, I wonder if your node is on containderd 1.17.5?
Both of the clusters I've seen the issue on are 1.7.14-1 it appears
if you're speaking to this https://github.com/containerd/containerd/issues/10036 it may be related to that version, interesting.
Looks like they are pushing out the new containerd version soon. Hopefully sooner than later.
Yes, looks like our nodes are also on containerd://1.7.14-1 and the "good" cluster is on the outdated containerd://1.7.1+azure-1
...and finally, it's working. After updating AKS (image version AKSUbuntu-2204gen2containerd-202404.09.0) my probes are good. These warnings drove me nuts :D
Thanks to @Aaron-ML + @ryanzhang-oss for your help!
Action required from @Azure/aks-pm
we've recently upgraded our test environment on Azure Kubernetes Service (AKS) to version 1.29.2 and started encountering an intermittent issue with liveness probes on our Redis setup (Bitnami Helm Chart).
Troubleshooting Done:
We're still trying to root out the cause as there's no apparent system load or network latency that would be disrupting the liveness checks (or maybe there is a brief drop for 1 call, but we were not able to reproduce it by doing the same checks in parallel in the shell).
So ~99% of all liveness probe runs are good and ~1% fail, without any indication of network load. A parallel setup with the exact same config on 1.26.3 (EOL) runs without any problems. Has anyone else encountered this post-upgrade?