Open hansesm opened 8 months ago
I had the same issue recently. Thanks for the workaround, @hansesm !
Hi @hansesm @pappacena, thanks for the extended bug report and the documented steps. How are the GPU drivers installed/built on the systems in question?
The gpu-operator
will attempt to install the driver at /run/nvidia/driver
if no driver is loaded already. The steps above look like an installation where the gpu-operator
installed the driver, but then you switched to use the drivers from the host instead. The linked issue seems to describe the same problem.
An easier approach to this, ensuring that the host driver is used (if available) would be to enable the addon like this, depending on your scenario:
# make sure that host drivers are used
microk8s enable nvidia --gpu-operator-driver=host
# make sure that the operators builds and installs the nvidia drivers
microk8s enable nvidia --gpu-operator-driver=operator
Hope this helps! Can you try this on a clean system and report back? Thanks!
Summary
The default GPU | NVIDIA addon does not find the correct drivers and thus containers are crashing.
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
What Should Happen Instead?
Everything should work after enabling GPU-Addon. microk8s enable nvidia
Reproduction Steps
microk8s enable nvidia
microk8s kubectl get pods --namespace gpu-operator-resources
microk8s kubectl describe pod nvidia-operator-validator-hxfbf -n gpu-operator-resources
nvidia-smi
ls -la /run/nvidia/driver
cat /etc/docker/daemon.json
cat /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
microk8s inspect
microk8s kubectl describe clusterpolicies --all-namespaces
Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown
Can you suggest a fix?
Change values in:
/usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml
root = "/run/nvidia/driver" to root = "/"
/usr/local/nvidia/toolkit/nvidia-container-runtime added: "runtimes": { "nvidia": { "path": "/usr/local/nvidia/toolkit/nvidia-container-runtime", "runtimeArgs": [] } }
Added symlink:
ln -s /sbin /run/nvidia/driver/sbin
restart k8s
microk8s stop microk8s start
Then all containers are starting up correctly !
Best regards !
EDIT: Found following issue containing the same issue: https://github.com/NVIDIA/gpu-operator/issues/511