Closed JonShelley closed 3 years ago
Yeah this can be confusing.
Basically on most installs UVM is not loaded by default, CUDA will implictly load it through nvidia-modprobe
if it's not present and nvidia-docker
will do it through nvidia-container-cli --load-kmods
.
Now since enroot is unpriviliged it can't do any of that, it will issue a warning but chances are you won't see it with pyxis.
One way to fix it is to load it on boot with a service, see https://github.com/NVIDIA/nephele/blob/bd404a9f7351f141dff362ccbc4d263bb3c42109/ansible/playbooks/nvidia.yml#L47 Another way is to use udev, IIRC some of the Ubuntu packages might do it.
I tried the boot service nvidia-uvm in the link but that didn't seem to address the reboot issue. Any suggestions on how to do it with udev or the ubuntu packages?
This should fix it, check the logs of what happened and why the module didn't get loaded by the service. For udev you can extract the official packages see how they do it
Closing since this isn't an issue with enroot per se.
The other day, my system was running just fine before a reboot. After the reboot, I started seeing the following errors when I tried to run a Slurm (with Pyxis and enroot) job. It complained about nvidia-uvm module not being loaded and not being able to find /dev/nvidia-modeset. After some digging I was able figure out why. As part of my machine setup I had run the command
sudo docker run --runtime=nvidia --rm nvidia/cuda:11.0-base nvidia-smi
This seemed to mount /dev/nvidia-modeset and load the nvidia-uvm module. I wanted to document this incase others hit the issue. The issue that I am now trying to resolve is how to get whatever modules and mount points the docker command is loading to be done at system boot up. Is there something easy that I am missing?