After reboot of system, enroot fails to launch containers.

NVIDIA / enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.

Apache License 2.0

649 stars 95 forks source link

After reboot of system, enroot fails to launch containers. #106

Closed JonShelley closed 3 years ago

JonShelley commented 3 years ago

The other day, my system was running just fine before a reboot. After the reboot, I started seeing the following errors when I tried to run a Slurm (with Pyxis and enroot) job. It complained about nvidia-uvm module not being loaded and not being able to find /dev/nvidia-modeset. After some digging I was able figure out why. As part of my machine setup I had run the command

sudo docker run --runtime=nvidia --rm nvidia/cuda:11.0-base nvidia-smi

This seemed to mount /dev/nvidia-modeset and load the nvidia-uvm module. I wanted to document this incase others hit the issue. The issue that I am now trying to resolve is how to get whatever modules and mount points the docker command is loading to be done at system boot up. Is there something easy that I am missing?

3XX0 commented 3 years ago

Yeah this can be confusing. Basically on most installs UVM is not loaded by default, CUDA will implictly load it through nvidia-modprobe if it's not present and nvidia-docker will do it through nvidia-container-cli --load-kmods.

Now since enroot is unpriviliged it can't do any of that, it will issue a warning but chances are you won't see it with pyxis.

One way to fix it is to load it on boot with a service, see https://github.com/NVIDIA/nephele/blob/bd404a9f7351f141dff362ccbc4d263bb3c42109/ansible/playbooks/nvidia.yml#L47 Another way is to use udev, IIRC some of the Ubuntu packages might do it.

JonShelley commented 3 years ago

I tried the boot service nvidia-uvm in the link but that didn't seem to address the reboot issue. Any suggestions on how to do it with udev or the ubuntu packages?

3XX0 commented 3 years ago

This should fix it, check the logs of what happened and why the module didn't get loaded by the service. For udev you can extract the official packages see how they do it

3XX0 commented 3 years ago

Closing since this isn't an issue with enroot per se.