Enroot Images and Nvidia Driver

Juanjdurillo commented 3 years ago

Many times available Docker images (wether coming from NGC or Docker Hub) do not provide all the functionality needed for some use case and we extend these ones using the following workflow

enroot import X 
enroot create  
enroot start (and install what is missing) 
enroot export

This workflow works fine most of the time but when the Nvidia driver is updated it makes that container not able to start anymore unless the above procedure is repeated and in spite the backward compatibility of the driver with previous cuda toolkits.

Is there any way of avoiding this? https://github.com/NVIDIA/nvidia-container-runtime#nvidia_driver_capabilities shows some env variables that can be passed onto nvidia-container-runtime. Assuming that utility and compute are passed onto the container and the driver is compatible with previous cuda toolkit, why these enroot containers would not start after driver upgrades? Is this the intended behavior?

flx42 commented 3 years ago

Hello,

What error message do you see when you attempt to start the container after the driver upgrade?

Juanjdurillo commented 3 years ago

hi,

RuntimeError: cuda runtime error (803) : system has unsupported display driver / cuda driver combination This happened with an Enroot image (i.e., sqsh) created with driver 470.57.02 (even a previous version, I do not recall this properly). PyTorch was installed on the Enroot container before exporting it.

That image does not run now on a node where the driver version has been updated to 470.63.01 (keeps running on a non-updated node).

3XX0 commented 3 years ago

Make sure that you didn't install driver dependencies in the container inadvertently dpkg -l | grep cuda. If not, then:

If you're relying on backward compatibility, it might actually be your host that hasn't been upgraded correctly. Check (on the host) that the userspace driver matches the kernel space by comparing say the version of ldconfig -p | awk '/libcuda.so/{print $4}' | xargs realpath and nvidia-smi -q | grep 'Driver Version'
If you're relying on forward compatibility, check that the environment variable NVIDIA_REQUIRE_CUDA in the container has been set correctly.

Juanjdurillo commented 3 years ago

Hi,

indeed the container had brought along some of cuda libraries (visible with dpkg -l). I guess then this is the source of the errors. Thanks I mark this as solved

NVIDIA / enroot

Enroot Images and Nvidia Driver #96