Closed santurini closed 1 week ago
@elezar @klueska I re-tried the experiment on a H100 obtaining the same error when deploying the nvidia-device-plugin. Same pod terminated with exactly same logs, but if I deploy only the gpu-operator it successfully deploys and is able to find the NVML library. Could you please help me?
I am following @elezar guide on how to enable MPS in a kubernetes cluster (I am using k3s) and after deploying the gpu-operator, the
nvidia-device-plugin-ctr
fails to start.Similar Issues
This is similar to #478 so I would ask also @klueska to take a look and try to help me. I also checked that libnvidia-ml.so.1 was present in the machine and actually it is, located here:
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
Executed commands
Failing pod logs
NVIDIA libraries
NVIDIA-SMI output
Docker config
Containerd config