NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.5k stars 270 forks source link

Nvidia-container-toolkit #378

Open MorphSeur opened 9 months ago

MorphSeur commented 9 months ago

Hello!

After a careful follow of the installation guide of NVIDIA Container Toolkit, a docker image is unable to use nvidia runtime.

/etc/apt/sources.list.d/cuda-ubuntu2204-x86_64.list:deb [signed-by=/usr/share/keyrings/cuda-archive-keyring.gpg] https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /
/etc/apt/sources.list.d/nvidia-container-toolkit.list:deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/stable/deb/$(ARCH) /
/etc/apt/sources.list.d/nvidia-container-toolkit.list:deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://nvidia.github.io/libnvidia-container/experimental/deb/$(ARCH) /
ubuntu@ubuntu:~$ sudo nvidia-ctk runtime configure --runtime=docker
INFO[0000] Loading config from /etc/docker/daemon.json  
INFO[0000] Wrote updated config to /etc/docker/daemon.json 
INFO[0000] It is recommended that docker daemon be restarted. 
ubuntu@ubuntu:~$ cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
ubuntu@ubuntu:~$sudo systemctl restart docker
ubuntu@ubuntu:~$ docker run --runtime=nvidia --gpus 1 --rm -v ./models/:/models -v ./audios:/audios -v ./outputs:/outputs cuda-app:latest
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.
ubuntu@ubuntu:~$ docker images
REPOSITORY    TAG                       IMAGE ID       CREATED         SIZE
ubuntu        latest                    3db8720ecbf5   2 weeks ago     77.9MB
cuda-app      latest                    0978724b7806   3 weeks ago     2.75GB
nvidia/cuda   12.3.1-base-ubuntu20.04   d13839a3c4fb   2 months ago    246MB
hello-world   latest                    d2c94e258dcb   10 months ago   13.3kB
ubuntu@ubuntu:~$ docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.3.1-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: unknown or invalid runtime name: nvidia.
See 'docker run --help'.
ubuntu@ubuntu:~$ docker --version
Docker version 24.0.7, build afdd53b
ubuntu@ubuntu:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Sep__8_19:17:24_PDT_2023
Cuda compilation tools, release 12.3, V12.3.52
Build cuda_12.3.r12.3/compiler.33281558_0
ubuntu@ubuntu:~$ nvidia-smi
Tue Feb 27 15:23:00 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:04:00.0 Off |                  N/A |
| 34%   25C    P8              12W / 350W |      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  | 00000000:07:00.0 Off |                  N/A |
|  0%   45C    P8              26W / 350W |      1MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Thanks a lot for your help in this issue!

elezar commented 9 months ago

Hi. How was Docker installed? Is this Docker desktop, or docker engine?

MorphSeur commented 9 months ago

Thanks for your reply!

The installation is a Docker Engine.

elezar commented 8 months ago

I would say that the primary issue is that you're not able to configure the nvidia runtime for your docker installation. It could be that the config file is not being used and that arguments to the docker daemon are being used instead. Could you confirm whether this is the case?

MorphSeur commented 8 months ago

Thanks a lot for your reply.

The nvidia-runtime is set correctly following the documentation, May I know which config file? Is it /etc/nvidia-container-runtime/config.toml?

elezar commented 8 months ago

The documentation is valid if the Docker daemon is using the /etc/docker/daemon.json config file. If the daemon is configured through another mechanism or uses a different config file, the instructions need to be adapted. How is your docker daemon configured?

MorphSeur commented 8 months ago

Thanks for your reply.

The daemon is configured using: sudo nvidia-ctk runtime configure --runtime=docker Here is the file:

{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }
}
elezar commented 8 months ago

@MorphSeur if this config was being applied, then the following error would not be triggered:

docker: Error response from daemon: unknown or invalid runtime name: nvidia.

This is triggered repeatedly for all your examples indicating that the runtime is not being configured correctly. To address this we would have to understand what is non-standard about your docker installation. Do the docker daemon logs (journalctl -xu docker.service) show any messages related to the config or the runtimes when the daemon is (re)started?

Are you perhaps running a rootless docker so that the instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#rootless-mode may need to be followed?

MorphSeur commented 8 months ago

Yes, the docker deamon logs contain errors related to runtime; they are similar as the error above.

Feb 27 11:11:45 ubuntu dockerd[949569]: time="2024-02-27T11:11:45.380223199+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:11:45 ubuntu dockerd[949569]: time="2024-02-27T11:11:45.380234811+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:11:45 ubuntu dockerd[949569]: time="2024-02-27T11:11:45.487476417+01:00" level=error msg="Handler for POST /v1.43/containers/87f622a9e91d4ff977b7684e279badd8345e63ed91ffa10923fa34080754d593/start returned error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown"
Feb 27 11:11:50 ubuntu dockerd[949569]: time="2024-02-27T11:11:50.315445730+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:11:50 ubuntu dockerd[949569]: time="2024-02-27T11:11:50.315497738+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:11:50 ubuntu dockerd[949569]: time="2024-02-27T11:11:50.419085301+01:00" level=error msg="Handler for POST /v1.43/containers/0f60f75fd2c4b678fa25aa43b0323a4822d295acade31e2c29ee4670e66d58cb/start returned error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown"
Feb 27 11:21:51 ubuntu dockerd[949569]: time="2024-02-27T11:21:51.244479007+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:21:51 ubuntu dockerd[949569]: time="2024-02-27T11:21:51.244519553+01:00" level=error msg="stream copy error: reading from a closed fifo"
Feb 27 11:21:51 ubuntu dockerd[949569]: time="2024-02-27T11:21:51.434034408+01:00" level=error msg="Handler for POST /v1.43/containers/0b3069f08e0eb41d9c0f5967fa113c07fa5d27d72792f3b4a4d05e57a8225851/start returned error: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'\nnvidia-container-cli: initialization error: nvml error: driver/library version mismatch: unknown"
Feb 27 12:01:05 ubuntu systemd[1]: Stopping Docker Application Container Engine...

Regarding the docker, it is running with root.