Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

[BUG] ContainerD deadlock present in 1.7.15, fixed in 1.7.17 or newer #4426

Closed ben-childs-docusign closed 1 month ago

ben-childs-docusign commented 2 months ago

Describe the bug

We are seeing our AKS nodes running 1,29.5 go into a not ready state and looking at logs it appears that containerd is hanging and becoming non responsive.

There are 2 deadlock bugs fixed in containerd 1.7.16 and 1.7.17 https://github.com/containerd/ttrpc/pull/168 https://github.com/containerd/nri/pull/79

When can we expect containerd to be upgraded to 1.7.17 or newer to address these deadlock issues?

To Reproduce

We are seeing this issue most reliably when we enable istio native sidecars [https://learn.microsoft.com/en-us/azure/aks/istio-native-sidecar] on our test cluster where we have a large number of cron jobs running to execute various tests. This is blocking us from adopting istio native sidecars in any production environments.

Expected behavior

Our cluster nodes remain in a ready state

Screenshots

unnamed

Environment (please complete the following information):

Additional context Add any other context about the problem here.

UtheMan commented 1 month ago

We are working on bumping the containerd version to .20 patch version. It will be available with one of the upcoming node image versions. I will share an update in this thread once the roll out starts. Thank you for bringing this up.

UtheMan commented 1 month ago

We now have a new node image version releasing which has containerd 1.7.20. The node image version with updated containerd is 202407.29.0. You can track the progress of the release here (AKS Node Images tab on the left side). It will take a couple of weeks before this version reaches all the regions. Closing the issue for now - feel free to re-open as needed.

ben-childs-docusign commented 1 month ago

Thank you we are testing the fixes now. FYI we also tested the azure linux image which has containerd 1.6.20 and that also has a deadlock bug fixed in 1.6.25 https://github.com/containerd/containerd/pull/9210

Edit: Actually azure linux latest images has containerd 1.6.26 so we are continuing to test with azurelinux.

ben-childs-docusign commented 1 month ago

@UtheMan

Unfortunately it looks like deadlock issue is still happening for us even with the new version of containerd. We will continue investigating