NVIDIA / gpu-operator

NVIDIA GPU Operator creates, configures, and manages GPUs in Kubernetes
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html
Apache License 2.0
1.83k stars 297 forks source link

Cannot install gpu-operator due to kubelet error: Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured #589

Open yuzs2 opened 1 year ago

yuzs2 commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

2. Issue or feature description

I'm trying to install gou-operator on NVIDIA vGPU following the doc: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html#nvidia-vgpu and https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator.html#nvidia-vgpu

The driver container image was built successfully.

$sudo docker build --build-arg DRIVER_TYPE=vgpu --build-arg DRIVER_VERSION=510.85.02-grid --build-arg CUDA_VERSION=11.6.2 --build-arg TARGETARCH=amd64 -t ${my_repo}/gpuaas/nvidia-driver:510.85.02-ubuntu20.04 .
.......
Successfully built 027c1a95eab7
Successfully tagged ${my_repo}/gpuaas/nvidia-driver:510.85.02-ubuntu20.04

However, the installation was not successful:

$ helm install  gpu-operator  nvidia/gpu-operator --version v22.9.1 -n gpu-operator --set driver.repository=${my_repo}/gpuaas/nvidia-driver --set driver.version=510.85.02 --set driver.licensingConfig.configMapName=licensing-config

$ k get pods -n gpu-operator                                
NAME                                                          READY   STATUS     RESTARTS   AGE
gpu-feature-discovery-7t6nz                                   0/1     Init:0/1   0          40m
gpu-operator-5fbd5bf5cf-d6fdw                                 1/1     Running    0          40m
gpu-operator-node-feature-discovery-master-59b4b67f4f-zxswq   1/1     Running    0          40m
gpu-operator-node-feature-discovery-worker-92kgr              1/1     Running    0          40m
nvidia-container-toolkit-daemonset-rjnlr                      0/1     Init:0/1   0          40m
nvidia-dcgm-exporter-bdwsq                                    0/1     Init:0/1   0          40m
nvidia-device-plugin-daemonset-czvvr                          0/1     Init:0/1   0          40m
nvidia-operator-validator-x2fpc                               0/1     Init:0/4   0          40m

$k -n gpu-operator describe pod gpu-feature-discovery-7t6nz
Events:
  Type     Reason                  Age                From               Message
  ----     ------                  ----               ----               -------
  Normal   Scheduled               85s                default-scheduler  Successfully assigned gpu-operator/gpu-feature-discovery-7t6nz to norma-md-ng-2-p9gcx-697c74f87bxz5lws-rxspg
  Warning  FailedCreatePodSandBox  12s (x7 over 84s)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Btw, if I ssh into the GPU node and manually install the driver (NVIDIA-Linux-x86_64-510.85.02-grid.run), then I can successfully install the gpu-operator with the same command above.

waterfeeds commented 10 months ago

Hi bro, I once encountered the same error. I'll give you my example for your reference. A week ago, I installed the nvidia driver, toolkits and device-plugin manually for test gpu running. I run containerd as runtime for kubelet, on ubuntu 22.04, then it works on cuda testing. A few days ago I tried gpu-operator installation, before that i uninstall nvidia driver, toolkits and device-plugin, and reverted the /etc/containerd/config.toml config. I got the same error as you.I had read many old issues about this err, then I found a committer of gpu-operator recommended lsmod | grep nvidia command, so I found some nvidia driver using by ubuntu kernel, meaned that uninstall imcompletely, so i reboot my host, and lsmod | grep nvidia command get nothing. Glad to say, everything is ok, all the nvidia pod become running. Hope useful to you !