Closed Ru13en closed 1 year ago
Thanks for creating the new issue @Ru13en
Here I would assume that kernel modules cannot be loaded by the NVIDIA container runtime hook. This also prevents the device nodes from being created. nvidia-smi
ends up loading the Kernel modules and creating the device nodes, but does seem to skip the creation of nvidia-uvm
and nvidia-uvm-tools
-- which is handled by the "Device Node verification" script that you mentioned.
Is it possible to run the script on startup of the system?
@elezar Yes, i fixed it creating a script that runs both commands at startup. However, is not a user friendly approach...
I don't know whether there is a way around this for rootless podman (I would have to check), but I would expect this to work in the rootful case since the NVIDIA container toolkit DOES load the kernel modules and create the devices nodes on the host as part of creating the container. Could you uncomment the debug
option in the toolkit config (#debug = "/var/log/nvidia-container-toolkit.log"
) and attach the contents of the file when launching a rootful container that fails?
Testing with:
podman run --privileged -it --rm tensorflow/tensorflow:latest-gpu python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
@elezar for some reason now i cannot replicate the issue for rootful runs, but the behavior continues on rootless (maybe with some updates it was fixed, since i made the previous post in May). For rootless, unless the root user starts a container it will trigger:
Error: error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1): OCI runtime error
If i run the command with sudo and after without it, it runs normally (the NVIDIA container toolkit is loading the kernel modules and devices nodes)
Please see the updated instructions for running the NVIDIA Container Runtime with Podman.
If you're still having problems, please open a new issue against https://github.com/NVIDIA/nvidia-container-toolkit.
1. Issue or feature description
For each system boot/reboot rootless podman does not work with the nvidia plugin. I must run nvidia-smi, otherwise i get the error:
Error: error executing hook `/usr/bin/nvidia-container-toolkit` (exit code: 1): OCI runtime error
After that i also need to run the NVIDIA Device Node Verification script to proper startup the /dev/nvidia-uvm for CUDA applications as described in this post: https://github.com/tensorflow/tensorflow/issues/32623#issuecomment-533936509
2. Steps to reproduce the issue
Install CentOS 8 with selinux enabled + Nvidia linux drivers. Install podman and nvidia-container-runtime Configure /etc/nvidia-container-runtime/config.toml (as attachment) Reboot the HW
Run the command (it will fail if you dont use nvidia-smi & nvidia-device-node-verification, after each reboot):
Run the commands (it will work):
3. Information to attach (optional if deemed irrelevant)
getenforce:
Enforcing
podman info:
cat /etc/nvidia-container-runtime/config.toml