Closed ben-childs-docusign closed 1 month ago
We are working on bumping the containerd version to .20
patch version. It will be available with one of the upcoming node image versions. I will share an update in this thread once the roll out starts. Thank you for bringing this up.
We now have a new node image version releasing which has containerd 1.7.20. The node image version with updated containerd is 202407.29.0
. You can track the progress of the release here (AKS Node Images tab on the left side). It will take a couple of weeks before this version reaches all the regions. Closing the issue for now - feel free to re-open as needed.
Thank you we are testing the fixes now. FYI we also tested the azure linux image which has containerd 1.6.20 and that also has a deadlock bug fixed in 1.6.25 https://github.com/containerd/containerd/pull/9210
Edit: Actually azure linux latest images has containerd 1.6.26 so we are continuing to test with azurelinux.
@UtheMan
Unfortunately it looks like deadlock issue is still happening for us even with the new version of containerd. We will continue investigating
Describe the bug
We are seeing our AKS nodes running 1,29.5 go into a not ready state and looking at logs it appears that containerd is hanging and becoming non responsive.
There are 2 deadlock bugs fixed in containerd 1.7.16 and 1.7.17 https://github.com/containerd/ttrpc/pull/168 https://github.com/containerd/nri/pull/79
When can we expect containerd to be upgraded to 1.7.17 or newer to address these deadlock issues?
To Reproduce
We are seeing this issue most reliably when we enable istio native sidecars [https://learn.microsoft.com/en-us/azure/aks/istio-native-sidecar] on our test cluster where we have a large number of cron jobs running to execute various tests. This is blocking us from adopting istio native sidecars in any production environments.
Expected behavior
Our cluster nodes remain in a ready state
Screenshots
Environment (please complete the following information):
Additional context Add any other context about the problem here.