Upgrading gpu-operator on Rancher RKE2 results in nvidia-container-toolkit-daemonset failing to initialize

NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html

Apache License 2.0

1.86k stars 303 forks source link

Upgrading gpu-operator on Rancher RKE2 results in nvidia-container-toolkit-daemonset failing to initialize #1099

Closed nikito closed 3 weeks ago

nikito commented 3 weeks ago

When upgrading to latest gpu-operator v24.9.0, when the nvidia-container-toolkit-daemonset fails to initialize with the following error: level=error msg="error running nvidia-toolkit: unable to determine runtime options: unable to load containerd config: failed to load config: failed to run command chroot [/host containerd config dump]: exit status 127"

If I rollback to v24.6.2 everything initializes correctly.

nikito commented 3 weeks ago

Closing issue, not sure what I did but I uninstalled the operator, then reinstalled 24.9.0 from scratch and everything appears to be working now.

jwindhager commented 2 weeks ago

I can confirm the issue. Running v24.9.0 in K3S/Flux on Ubuntu 24.04 LTS with driver 535. Rollback to v24.6.2 fixes the issue. Unlike @nikito, I did not manage to upgrade to v24.9.0 after the rollback (tried uninstalling & reinstalling from scratch). Staying with v24.6.2 for now.