NVM craches suddenly - Githubissues

epaaso commented 1 year ago

After a while idle the gpu in the container (netopaas/sc_arches) no longer works: Failed to initialize NVML: Unknown Error

For now only fix is stopping container and starting again

epaaso commented 1 year ago

Maybe it happens when disconnecting from proxy while using the GPU

epaaso commented 1 year ago

Maybe it happens when disconnecting from proxy while using the GPU

It does not, did some tests.

It has something to do with updating of the container. This issue: https://github.com/NVIDIA/nvidia-docker/issues/1469

Also running without --runtime=nvidia may help. (This did not help)

epaaso commented 1 month ago

This is an official guide to workaround the problem while a fix is merged.

This the exact solution: For Docker environments Using the nvidia-ctk utility:

The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in /dev/char for all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows:

sudo nvidia-ctk system create-dev-char-symlinks \
    --create-all

This command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.

A simple udev rule to enforce this can be seen below:

# This will create /dev/char symlinks to all device nodes
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system     create-dev-char-symlinks --create-all"

A good place to install this rule would be: /lib/udev/rules.d/71-nvidia-dev-char.rules

epaaso commented 1 month ago

Ya corrimos esto en el server y no funcionó para un contenedor parado, pero si deberia para uno corrido desde una imagen. Para lo que si funcionó es para este contenedor: Asi que si no sirve para uno corrido desde la imagen quizá debamos poner los otros devices:

docker run -d --rm --runtime=nvidia --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"

epaaso / sc-luca-explore

NVM craches suddenly #10