NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.17k stars 2.03k forks source link

NOTICE: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error" #1730

Closed cdesiniotis closed 5 months ago

cdesiniotis commented 1 year ago

1. Executive summary

Under specific conditions, it’s possible that containers may be abruptly detached from the GPUs they were initially connected to. We have determined the root cause of this issue and identified the affected environments this can occur in. Workarounds for the affected environments are provided at the end of this document until a proper fix is released.

2. Summary of the issue

Containerized GPU workloads may suddenly lose access to their GPUs. This situation occurs when systemd is used to manage the cgroups of the container and it is triggered to reload any Unit files that have references to NVIDIA GPUs (e.g. with something as simple as a systemctl daemon-reload).

When the container loses access to the GPU, you will see the following error message from the console output:

Failed to initialize NVML: Unknown Error

The container needs to be deleted once the issue occurs.

When it is restarted (manually or automatically depending on the use of a container orchestration platform), it will regain access to the GPU.

The issue originates from the fact that recent versions of runc require that symlinks be present under /dev/char to any device nodes being injected into a container. Unfortunately, these symlinks are not present for NVIDIA devices, and the NVIDIA GPU driver does not (currently) provide a means for them to be created automatically.

A fix will be present in the next patch release of all supported NVIDIA GPU drivers

3. Affected environments

Affected environments are those using runc and enabling systemd cgroup management at the high-level container runtime.

If the system is NOT using systemd to manage cgroups, then it is NOT subject to this issue.

An exhaustive list of the affected environments is provided below:

Note: Podman environments use crun by default and are not subject to this issue unless runc is configured as the low-level container runtime to be used.

4. How to check if you are affected

You can use the following steps to confirm that your system is affected. After you implement one of the workarounds (mentioned in the next section), you can repeat the steps to confirm that the error is no longer reproducible.

For Docker environments

Run a test container:

$ docker run -d --rm --runtime=nvidia --gpus all \
    --device=/dev/nvidia-uvm \
    --device=/dev/nvidia-uvm-tools \
    --device=/dev/nvidia-modeset \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
    nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04 bash -c "while [ true ]; do nvidia-smi -L; sleep 5; done"  

bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b

Note: Make sure to mount the different devices as shown above. They are needed to narrow the problem down to this specific issue.

If your system has more than 1 GPU, append the above command with the additional --device mount. Example with a system that has 2 GPUs:

$ docker run -d --rm --runtime=nvidia --gpus all \
    ...
    --device=/dev/nvidia0 \
    --device=/dev/nvidia1 \
    ...

Check the logs from the container:

$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b

GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)

Then initiate a daemon-reload:

$ sudo systemctl daemon-reload

Check the logs from the container:

$ docker logs bc045274b44bdf6ec2e4cc10d2968d1d2a046c47cad0a1d2088dc0a430add24b

GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
GPU 0: Tesla K80 (UUID: GPU-05ea3312-64dd-a4e7-bc72-46d2f6050147)
Failed to initialize NVML: Unknown Error
Failed to initialize NVML: Unknown Error

For K8s environments

Run a test pod:

$ cat nvidia-smi-loop.yaml

apiVersion: v1
kind: Pod
metadata:
  name: cuda-nvidia-smi-loop
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda
    image: "nvcr.io/nvidia/cuda:12.0.0-base-ubuntu20.04"
    command: ["/bin/sh", "-c"]
    args: ["while true; do nvidia-smi -L; sleep 5; done"]
    resources:
      limits:
        nvidia.com/gpu: 1

$ kubectl apply -f nvidia-smi-loop.yaml  

pod/cuda-nvidia-smi-loop created

Check the logs from the pod:

$ kubectl logs cuda-nvidia-smi-loop

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)

Then initiate a daemon-reload:

$ sudo systemctl daemon-reload

Check the logs from the pod:

$ kubectl logs cuda-nvidia-smi-loop

GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-551720f0-caf0-22b7-f525-2a51a6ab478d)
Failed to initialize NVML: Unknown Error
Failed to initialize NVML: Unknown Error

5. Workarounds

The following workarounds are available for both standalone docker environments and k8s environments (multiple options are presented by order of preference; the one at the top is the most recommended):

For Docker environments

For K8s environments

elezar commented 5 months ago

See https://github.com/NVIDIA/nvidia-container-toolkit/issues/48