Containerd sudden restart stops pods from initializing

Nightcro commented 4 months ago

What happened: Containerd sometimes stops responding and systemd commences a restart of the containerd service. Sometimes when this happens containers which should start running are stuck and kubelet receives the following error: Mar 08 09:26:47 Error: error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer

What you expected to happen: The containers start properly

How to reproduce it (as minimally and precisely as possible): It just happens sometimes, containerd stops working and systemd commences restart of the service

Mar 08 09:26:39 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:39.009195045Z" level=info msg="StartContainer for \"d4e5a065aa2b82e9e55d99a1e18ebc21612b6a47b9d88a2b78c254dcb88e305f\" returns successfully"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.558871785Z" level=info msg="TaskExit event container_id:\"42559dbacef4a6284a559cd375e07034d5acced691d0c5571a24a5be16613d4f\" id:\"42559dbacef4a6284a559cd375e07034d5acced691d0c5571a24a5be16613d4f\" pid:7044 exited_at:{seconds:1709889983 nanos:110845885}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.559033340Z" level=info msg="Ensure that container 42559dbacef4a6284a559cd375e07034d5acced691d0c5571a24a5be16613d4f in task-service has been cleanup successfully"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.585191830Z" level=info msg="ImageCreate event name:\"nvcr.io/nvidia/k8s-device-plugin:v0.14.4\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.587087543Z" level=info msg="stop pulling image nvcr.io/nvidia/k8s-device-plugin:v0.14.4: active requests=0, bytes read=122703857"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.589296373Z" level=info msg="ImageCreate event name:\"sha256:0745b508898e2aa68f29a3c7f21023d03feace165b2430bc2297d250e65009e0\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.593900328Z" level=info msg="ImageUpdate event name:\"nvcr.io/nvidia/k8s-device-plugin:v0.14.4\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.597983400Z" level=info msg="ImageCreate event name:\"nvcr.io/nvidia/k8s-device-plugin@sha256:2388c1f792daf3e810a6b43cdf709047183b50f5ec3ed476fae6aa0a07e68acc\" labels:{key:\"io.cri-containerd.image\" value:\"managed\"}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.600021734Z" level=info msg="Pulled image \"nvcr.io/nvidia/k8s-device-plugin:v0.14.4\" with image id \"sha256:0745b508898e2aa68f29a3c7f21023d03feace165b2430bc2297d250e65009e0\", repo tag \"nvcr.io/nvidia/k8s-device-plugin:v0.14.4\", repo digest \"nvcr.io/nv
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.600066334Z" level=info msg="PullImage \"nvcr.io/nvidia/k8s-device-plugin:v0.14.4\" returns image reference \"sha256:0745b508898e2aa68f29a3c7f21023d03feace165b2430bc2297d250e65009e0\""
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.602778083Z" level=info msg="CreateContainer within sandbox \"987aa7badab0e155ce1eedb86b585c612c500069fd3bf035b3ce56733381abf4\" for container &ContainerMetadata{Name:config-manager-init,Attempt:0,}"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.607551842Z" level=info msg="shim reaped" error="<nil>" id=c40672482a90ec7bb1a4565f38e67d688f31c9b112a74666ca2bdf9d99b7b0fd namespace=k8s.io
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.630196168Z" level=info msg="CreateContainer within sandbox \"987aa7badab0e155ce1eedb86b585c612c500069fd3bf035b3ce56733381abf4\" for &ContainerMetadata{Name:config-manager-init,Attempt:0,} returns container id \"f671e12d5dfd26abca6b4d9c4e8a20edb9146aa6f203d
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.630610838Z" level=info msg="StartContainer for \"f671e12d5dfd26abca6b4d9c4e8a20edb9146aa6f203ddb12206c699ad0079a9\""
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.631072744Z" level=warning msg="\"io.containerd.runtime.v1.linux\" is deprecated since containerd v1.4 and will be removed in containerd v2.0, use \"io.containerd.runc.v2\" instead"
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.631776013Z" level=info msg="shim containerd-shim started" address="unix:///run/containerd/s/d7f0c28ba727aff56afd7a9a678052a171b6115629aaca72791e5ddb575b984b" debug=false error="<nil>" id=f671e12d5dfd26abca6b4d9c4e8a20edb9146aa6f203ddb12206c699ad0079a9 name
Mar 08 09:26:40 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:40.922054030Z" level=info msg="PullImage \"nvcr.io/nvidia/k8s/dcgm-exporter:3.3.0-3.2.0-ubuntu22.04\""
Mar 08 09:26:41 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:41.559075071Z" level=info msg="TaskExit event container_id:\"6498264b55ff945fd6798c81e9eeb1f5b6f61532d54a2fbf53f872262ed00311\" id:\"6498264b55ff945fd6798c81e9eeb1f5b6f61532d54a2fbf53f872262ed00311\" pid:7032 exited_at:{seconds:1709889983 nanos:132245534}"
Mar 08 09:26:41 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:41.559250077Z" level=info msg="Ensure that container 6498264b55ff945fd6798c81e9eeb1f5b6f61532d54a2fbf53f872262ed00311 in task-service has been cleanup successfully"
Mar 08 09:26:41 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:41.935339729Z" level=info msg="CreateContainer within sandbox \"df0a703ed60bddda5dbd2294fc4a0fbc085ee8fe6545936e103baa31e2617743\" for container &ContainerMetadata{Name:config-manager-init,Attempt:0,}"
Mar 08 09:26:42 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:42.668431383Z" level=info msg="CreateContainer within sandbox \"df0a703ed60bddda5dbd2294fc4a0fbc085ee8fe6545936e103baa31e2617743\" for &ContainerMetadata{Name:config-manager-init,Attempt:0,} returns container id \"0013adbc3ff0a8571a7e4e66d06b141d7e014b8a096b8
Mar 08 09:26:42 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:42.668960325Z" level=info msg="StartContainer for \"0013adbc3ff0a8571a7e4e66d06b141d7e014b8a096b82ba5cfcb7a0a32a0ea7\""
Mar 08 09:26:42 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:42.669480408Z" level=warning msg="\"io.containerd.runtime.v1.linux\" is deprecated since containerd v1.4 and will be removed in containerd v2.0, use \"io.containerd.runc.v2\" instead"
Mar 08 09:26:42 i-node.eu-west-1.compute.internal containerd[3730]: time="2024-03-08T09:26:42.670233245Z" level=info msg="shim containerd-shim started" address="unix:///run/containerd/s/1d217093a15d1d5d3f492f42d83dd9a75cb834ebb94fdcf69e47d62ecf3e3e40" debug=false error="<nil>" id=0013adbc3ff0a8571a7e4e66d06b141d7e014b8a096b82ba5cfcb7a0a32a0ea7 name
Mar 08 09:26:52 i-node.eu-west-1.compute.internal systemd[1]: containerd.service holdoff time over, scheduling restart.
Mar 08 09:26:52 i-node.eu-west-1.compute.internal systemd[1]: Stopped containerd container runtime.
Mar 08 09:26:52 i-node.eu-west-1.compute.internal systemd[1]: Starting containerd container runtime...
Mar 08 09:26:52 i-node.eu-west-1.compute.internal containerd[8234]: time="2024-03-08T09:26:52Z" level=warning msg="containerd config version `1` has been deprecated and will be removed in containerd v2.0, please switch to version `2`, see https://github.com/containerd/containerd/blob/main/docs/PLUGINS.md#version-header"
Mar 08 09:26:52 i-node.eu-west-1.compute.internal containerd[8234]: time="2024-03-08T09:26:52.429617534Z" level=info msg="starting containerd" revision=64b8a811b07ba6288238eefc14d898ee0b5b99ba version=1.7.11
Mar 08 09:26:52 i-node.eu-west-1.compute.internal containerd[8234]: time="2024-03-08T09:26:52.449569749Z" level=info msg="loading plugin \"io.containerd.warning.v1.deprecations\"..." type=io.containerd.warning.v1
Mar 08 09:26:52 i-node.eu-west-1.compute.internal containerd[8234]: time="2024-03-08T09:26:52.449600938Z" level=info msg="loading plugin \"io.containerd.snapshotter.v1.aufs\"..." type=io.containerd.snapshotter.v1

Anything else we need to know?: I have set up gpu-operator-v23.9.1 I have tried looking through the available logs in journalctl and pod logs, but found nothing relevant to why containerd stops working and needs a restart

Environment:

AWS Region: eu-west-1
Instance Type(s): g4dn.xlarge
EKS Platform version: eks.1
Kubernetes version: 1.29
AMI Version: amazon-eks-gpu-node-1.29-v20240227
Kernel: 5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Release information:

BASE_AMI_ID="ami-059705a71ed021143"
BUILD_TIME="Tue Feb 13 05:12:29 UTC 2024"
BUILD_KERNEL="5.10.192-183.736.amzn2.x86_64"
ARCH="x86_64"

cartermckinnon commented 4 months ago

It's odd that we're seeing holdoff time over, scheduling restart from systemd but no logs from the actual containerd crash. How often does this happen?

Nightcro commented 4 months ago

It seems to be quite random, at times we might have 2/3 nodes in the span of 10 minutes having this issue, after 2 hours maybe 1 node, after 5 hours another node I have not been able to pinpoint what is wrong The fact that I can not find anything in the logs seems really odd. At the moment we don't have any other workload to test against, these nodes that have GPU are the only ones that scale up and down for us. I am not sure if it is a combination of gpu-operator and the nodes or if it could be happening on other nodes as well

Nightcro commented 4 months ago

I'll check other healthy nodes that work and see if I can find the same behaviour in the journalctl of containerd

tl-alex-nicot commented 2 months ago

did you find anything out ?

Nightcro commented 2 months ago

In the end, I disabled the toolkit operator and it solved my issue.

awslabs / amazon-eks-ami

Containerd sudden restart stops pods from initializing #1716