Open chiragjn opened 2 months ago
Hi,
we are seeing the same issue with the gpu-operator-validator
daemonset.
We found in the log of nvidia-container-toolkit-daemonset
that it modified /etc/containerd/config.toml
and then sends a SIGHUP
to containerd
:
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Sending SIGHUP signal to containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Successfully signaled containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Completed 'setup' for containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Waiting for signal"
Then in the middle of the creation of one of the init containers of the gpu-operator-validator
daemonset the kubelet fails to communicate with the containerd
socket because containerd
restarts.
After a bunch of transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused
errors from the kubelet we see the following in our journald
log:
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: containerd.service holdoff time over, scheduling restart.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopping Kubernetes Kubelet...
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopped Kubernetes Kubelet.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopped containerd container runtime.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Starting Load NVIDIA kernel modules...
This looks like systemd also decides to restart containerd
after it should already have been restarted by the SIGHUP
. We are unsure why this happens.
The stuck pod shows Warning Failed 24m kubelet Error: error reading from server: EOF
in its events and the state of the pod shows the following for the plugin-validation
init container:
State: Waiting
Ready: False
Restart Count: 0
We are seeing this issue several times per day in our infrastructure. So if you have any ideas how to debug this further we should be able to reproduce it to provide more information.
Thanks in advance for any help :)
I am also experiencing the similar thing when attempting a test/dev deployment on K3d (uses a K3s-cuda base image).
As part of the nvidia-container-toolkit's container installation of the toolkit onto the host, it sends a signal to restart containerd, which then cycles then entire cluster since containerd.service
was restarted at a node's system-level.
If we disable the toolkit (toolkit.enabled: false
) from the deployment and instead directly install the toolkit on the node, then it no longer cycles the entire cluster, and everything works fine.
Original context and jounrnalctl logs here: https://github.com/containerd/containerd/issues/10437
As we know by default nvidia-container-toolkit sends a SIGHUP to containerd for the patched containerd config to take effect. Unfortunately the way gpu-operator schedules Daemonsets all at once, we have noticed our gpu discovery and nvidia device plugin pods get forever stuck in pending. This is primarily due to config-manager-init container getting stuck in Created and never transitioning to Running state due to containerd restart.
Timeline of race condition:
Today the only way for us to recover is to manually delete the stuck daemonset pods.
While I understand at the core this is containerd issue but this has become so troublesome we are looking for entrypoint and node label hacks. We are willing to take a solution that allows us to modify the entrypoint configmaps of daemonsets managed by ClusterPolicy.
I think something similar was discovered here but different effect https://github.com/NVIDIA/gpu-operator/commit/963b8dc87ed54632a7345c1fcfe842f4b7449565 and was fixed with a sleep
P.S. I am aware container-toolkit has an option to not restart containerd, but we need a restart for correct toolkit injection behavior
cc: @klueska