Closed Juanjdurillo closed 3 years ago
Hello,
What error message do you see when you attempt to start the container after the driver upgrade?
hi,
RuntimeError: cuda runtime error (803) : system has unsupported display driver / cuda driver combination
This happened with an Enroot image (i.e., sqsh) created with driver 470.57.02 (even a previous version, I do not recall this properly). PyTorch was installed on the Enroot container before exporting it.
That image does not run now on a node where the driver version has been updated to 470.63.01 (keeps running on a non-updated node).
Make sure that you didn't install driver dependencies in the container inadvertently dpkg -l | grep cuda
.
If not, then:
ldconfig -p | awk '/libcuda.so/{print $4}' | xargs realpath
and nvidia-smi -q | grep 'Driver Version'
NVIDIA_REQUIRE_CUDA
in the container has been set correctly.Hi,
indeed the container had brought along some of cuda libraries (visible with dpkg -l). I guess then this is the source of the errors. Thanks I mark this as solved
Many times available Docker images (wether coming from NGC or Docker Hub) do not provide all the functionality needed for some use case and we extend these ones using the following workflow
This workflow works fine most of the time but when the Nvidia driver is updated it makes that container not able to start anymore unless the above procedure is repeated and in spite the backward compatibility of the driver with previous cuda toolkits.
Is there any way of avoiding this? https://github.com/NVIDIA/nvidia-container-runtime#nvidia_driver_capabilities shows some env variables that can be passed onto nvidia-container-runtime. Assuming that utility and compute are passed onto the container and the driver is compatible with previous cuda toolkit, why these enroot containers would not start after driver upgrades? Is this the intended behavior?