Open yuzs2 opened 1 year ago
Hi bro, I once encountered the same error. I'll give you my example for your reference.
A week ago, I installed the nvidia driver, toolkits and device-plugin manually for test gpu running. I run containerd as runtime for kubelet, on ubuntu 22.04, then it works on cuda testing.
A few days ago I tried gpu-operator installation, before that i uninstall nvidia driver, toolkits and device-plugin, and reverted the /etc/containerd/config.toml config. I got the same error as you.I had read many old issues about this err, then I found a committer of gpu-operator recommended lsmod | grep nvidia
command, so I found some nvidia driver using by ubuntu kernel, meaned that uninstall imcompletely, so i reboot my host, and lsmod | grep nvidia
command get nothing. Glad to say, everything is ok, all the nvidia pod become running.
Hope useful to you !
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
2. Issue or feature description
I'm trying to install gou-operator on NVIDIA vGPU following the doc: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-vgpu.html#nvidia-vgpu and https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator.html#nvidia-vgpu
The driver container image was built successfully.
However, the installation was not successful:
Btw, if I ssh into the GPU node and manually install the driver (NVIDIA-Linux-x86_64-510.85.02-grid.run), then I can successfully install the gpu-operator with the same command above.