[Question] Liveness Probes Failing on AKS 1.29.2 with Bitnami Redis Helm Chart

noxwuan commented 1 month ago

we've recently upgraded our test environment on Azure Kubernetes Service (AKS) to version 1.29.2 and started encountering an intermittent issue with liveness probes on our Redis setup (Bitnami Helm Chart).

Error Message: Sporadically, the liveness probes fail with the following error:Liveness probe failed: command "sh -c /health/ping_liveness_local_and_master.sh 5" timed out
Setup --> exec [sh -c /health/ping_liveness_local.sh 5] delay=20s timeout=6s period=5s #success=1 #failure=5
Frequency: The failure happens randomly every 10-15 minutes and doesn't follow any discernible pattern.
Cluster Configuration: Our deployment consists of one master and two replicas.
Previous Version: Prior to the AKS upgrade, the same configuration on AKS/K8S 1.26.3 worked flawlessly (and still works, we have a parallel system that still runs the same chart under heavy load)
Probe Command: The liveness probe executes a shell script which essentially performs a redis-cli ping.

Troubleshooting Done:

We attempted to replicate the problem by running redis-cli ping directly in a loop on all three pods/containers (shell / same pod / same environment). However, there were no signs of latency or peaks in response times - we also ran Redis intrinsic tests (generating heavy load on all nodes) - but that did result in consistently stable values and the same random liveness probe issues.
I have rebooted all nodes on the pool, and even updated it to a new image (OS release from March to April), but it behaves the same way.

We're still trying to root out the cause as there's no apparent system load or network latency that would be disrupting the liveness checks (or maybe there is a brief drop for 1 call, but we were not able to reproduce it by doing the same checks in parallel in the shell).

So ~99% of all liveness probe runs are good and ~1% fail, without any indication of network load. A parallel setup with the exact same config on 1.26.3 (EOL) runs without any problems. Has anyone else encountered this post-upgrade?

Aaron-ML commented 1 month ago

We are also seeing familiar behavior using strimzi kafka and the redis helm chart.

It appears that the pods are ready to go yet aren't marked ready in kubernetes until 5-10 minutes later. The kubernetes event logs show no failing probes or anything concerning. These pods have been live in multiple clusters over years and haven't had any issues until this week.

We have seen this on AKS kubernetes 1.28.5 and 1.29.2 using Ubuntu 22.04.4 on Kernel 5.15.0-1059-azure. I've noticed this does correlate with a node upgrade but I can't pinpoint that issue.

ryanzhang-oss commented 1 month ago

There is a containerd regression that seems to be related, I wonder if your node is on containderd 1.17.5?

Aaron-ML commented 1 month ago

There is a containerd regression that seems to be related, I wonder if your node is on containderd 1.17.5?

Both of the clusters I've seen the issue on are 1.7.14-1 it appears

if you're speaking to this https://github.com/containerd/containerd/issues/10036 it may be related to that version, interesting.

Aaron-ML commented 1 month ago

Looks like they are pushing out the new containerd version soon. Hopefully sooner than later.

https://github.com/Azure/AKS/issues/4210

noxwuan commented 1 month ago

Yes, looks like our nodes are also on containerd://1.7.14-1 and the "good" cluster is on the outdated containerd://1.7.1+azure-1

noxwuan commented 1 month ago

...and finally, it's working. After updating AKS (image version AKSUbuntu-2204gen2containerd-202404.09.0) my probes are good. These warnings drove me nuts :D

Thanks to @Aaron-ML + @ryanzhang-oss for your help!

microsoft-github-policy-service[bot] commented 1 day ago

Action required from @Azure/aks-pm

Azure / AKS

[Question] Liveness Probes Failing on AKS 1.29.2 with Bitnami Redis Helm Chart #4219