Open Nightcro opened 4 months ago
It's odd that we're seeing holdoff time over, scheduling restart
from systemd
but no logs from the actual containerd
crash. How often does this happen?
It seems to be quite random, at times we might have 2/3 nodes in the span of 10 minutes having this issue, after 2 hours maybe 1 node, after 5 hours another node I have not been able to pinpoint what is wrong The fact that I can not find anything in the logs seems really odd. At the moment we don't have any other workload to test against, these nodes that have GPU are the only ones that scale up and down for us. I am not sure if it is a combination of gpu-operator and the nodes or if it could be happening on other nodes as well
I'll check other healthy nodes that work and see if I can find the same behaviour in the journalctl of containerd
did you find anything out ?
In the end, I disabled the toolkit operator and it solved my issue.
What happened: Containerd sometimes stops responding and systemd commences a restart of the containerd service. Sometimes when this happens containers which should start running are stuck and kubelet receives the following error:
Mar 08 09:26:47 Error: error reading from server: read unix @->/run/containerd/containerd.sock: read: connection reset by peer
What you expected to happen: The containers start properly
How to reproduce it (as minimally and precisely as possible): It just happens sometimes, containerd stops working and systemd commences restart of the service
Anything else we need to know?: I have set up gpu-operator-v23.9.1 I have tried looking through the available logs in journalctl and pod logs, but found nothing relevant to why containerd stops working and needs a restart
Environment:
5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux