Open epaaso opened 1 year ago
Maybe it happens when disconnecting from proxy while using the GPU
Maybe it happens when disconnecting from proxy while using the GPU
It does not, did some tests.
It has something to do with updating of the container. This issue: https://github.com/NVIDIA/nvidia-docker/issues/1469
Also running without --runtime=nvidia may help. (This did not help)
This is an official guide to workaround the problem while a fix is merged.
This the exact solution:
For Docker environments
Using the nvidia-ctk
utility:
The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in /dev/char for all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows:
sudo nvidia-ctk system create-dev-char-symlinks \
--create-all
This command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.
A simple udev rule to enforce this can be seen below:
# This will create /dev/char symlinks to all device nodes
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system create-dev-char-symlinks --create-all"
A good place to install this rule would be:
/lib/udev/rules.d/71-nvidia-dev-char.rules
Ya corrimos esto en el server y no funcionó para un contenedor parado, pero si deberia para uno corrido desde una imagen. Para lo que si funcionó es para este contenedor: Asi que si no sirve para uno corrido desde la imagen quizá debamos poner los otros devices:
docker run -d --rm --runtime=nvidia --gpus all \
--device=/dev/nvidia-uvm \
--device=/dev/nvidia-uvm-tools \
--device=/dev/nvidia-modeset \
--device=/dev/nvidiactl \
--device=/dev/nvidia0 \
nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"
After a while idle the gpu in the container (netopaas/sc_arches) no longer works:
Failed to initialize NVML: Unknown Error
For now only fix is stopping container and starting again