NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.86k stars 299 forks source link

containerd restart from nvidia-container-toolkit causes other daemonsets to get stuck #991

Open chiragjn opened 2 months ago

chiragjn commented 2 months ago

Original context and jounrnalctl logs here: https://github.com/containerd/containerd/issues/10437

As we know by default nvidia-container-toolkit sends a SIGHUP to containerd for the patched containerd config to take effect. Unfortunately the way gpu-operator schedules Daemonsets all at once, we have noticed our gpu discovery and nvidia device plugin pods get forever stuck in pending. This is primarily due to config-manager-init container getting stuck in Created and never transitioning to Running state due to containerd restart.

Timeline of race condition:

Today the only way for us to recover is to manually delete the stuck daemonset pods.

While I understand at the core this is containerd issue but this has become so troublesome we are looking for entrypoint and node label hacks. We are willing to take a solution that allows us to modify the entrypoint configmaps of daemonsets managed by ClusterPolicy.

I think something similar was discovered here but different effect https://github.com/NVIDIA/gpu-operator/commit/963b8dc87ed54632a7345c1fcfe842f4b7449565 and was fixed with a sleep

P.S. I am aware container-toolkit has an option to not restart containerd, but we need a restart for correct toolkit injection behavior

cc: @klueska

ekeih commented 1 month ago

Hi,

we are seeing the same issue with the gpu-operator-validator daemonset.

We found in the log of nvidia-container-toolkit-daemonset that it modified /etc/containerd/config.toml and then sends a SIGHUP to containerd:

nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Sending SIGHUP signal to containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Successfully signaled containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Completed 'setup' for containerd"
nvidia-container-toolkit-daemonset-d9dr9 nvidia-container-toolkit-ctr time="2024-10-09T16:33:13Z" level=info msg="Waiting for signal"

Then in the middle of the creation of one of the init containers of the gpu-operator-validator daemonset the kubelet fails to communicate with the containerd socket because containerd restarts. After a bunch of transport: Error while dialing: dial unix /run/containerd/containerd.sock: connect: connection refused errors from the kubelet we see the following in our journald log:

Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: containerd.service holdoff time over, scheduling restart.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopping Kubernetes Kubelet...
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopped Kubernetes Kubelet.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Stopped containerd container runtime.
Oct 10 10:46:01 ip-10-3-101-224.ec2.internal systemd[1]: Starting Load NVIDIA kernel modules...

This looks like systemd also decides to restart containerd after it should already have been restarted by the SIGHUP. We are unsure why this happens.

The stuck pod shows Warning Failed 24m kubelet Error: error reading from server: EOF in its events and the state of the pod shows the following for the plugin-validation init container:

    State:          Waiting
    Ready:          False
    Restart Count:  0

We are seeing this issue several times per day in our infrastructure. So if you have any ideas how to debug this further we should be able to reproduce it to provide more information.

Thanks in advance for any help :)

justinthelaw commented 1 week ago

I am also experiencing the similar thing when attempting a test/dev deployment on K3d (uses a K3s-cuda base image).

As part of the nvidia-container-toolkit's container installation of the toolkit onto the host, it sends a signal to restart containerd, which then cycles then entire cluster since containerd.service was restarted at a node's system-level.

If we disable the toolkit (toolkit.enabled: false) from the deployment and instead directly install the toolkit on the node, then it no longer cycles the entire cluster, and everything works fine.