NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.25k stars 2.03k forks source link

suse tumbleweed & nvidia-container-toolkit & could not select device driver "" #1377

Closed s4s0l closed 4 years ago

s4s0l commented 4 years ago

1. Issue or feature description

On tumbleweed (i know it's not supported) I'm unable to run:

→ docker run --rm --gpus all nvidia/cuda:latest nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Generally I'm not expecting solution but rather would like to understand how all of this should work together. My current findings are that my docker is not using /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json at all. I can put there any nonsense and I do not see anything complaining about it. I would like to understand why? As there is quite little documentation on docker side about how it uses oci hooks i do not know where to look for explanation. What should pick it up and under what circumstances? Is it docker itself or runc or ... ? I see in some packages on other distros hooks are placed in different paths like `/usr/share/containers/docker/...' or '/etc/containers/...'. I tried different versions of runc and some more random things but after reading documentation of docker, nvidia repos, oci specs i still cannot figure out how does it supposed to work. I would appreciate if someone could find a moment to write down how nvidia tools are integrated with docker. How does docker pick gpu "driver"? What makes nvidia hook to trigger only for containers started with '--gpus'? etc...

Driver seems to be running fine as far as i can tell (games, cuda based ML, blender). Issues i could find relate to docker not restarted after installation of toolkit or docker installed via snap, not my case.

2. Steps to reproduce the issue

Install docker nvidia drivers and nvidia-container-toolkit run container with --gpus .

3. Information


 - [x] Driver information from `nvidia-smi -a`
[nvidia-smi.txt](https://github.com/NVIDIA/nvidia-docker/files/5149229/nvidia-smi.txt)
 - [x] Docker version from `docker version`
[docker-version.txt](https://github.com/NVIDIA/nvidia-docker/files/5149232/docker-version.txt)
 - [x] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`
[zypper-packages.txt](https://github.com/NVIDIA/nvidia-docker/files/5149243/zypper-packages.txt)

→ rpm -qa 'nvidia' kernel-firmware-nvidia-20200807-1.2.noarch libnvidia-container-static-1.1.1-1.3.x86_64 nvidia-container-toolkit-0.0+git.1580519869.60f165a-1.4.x86_64 libnvidia-container-devel-1.1.1-1.3.x86_64 libnvidia-container1-1.1.1-1.3.x86_64 nvidia-gfxG05-kmp-default-450.57_k5.7.9_1-38.2.x86_64 nvidia-glG05-450.57-38.1.x86_64 nvidia-computeG05-450.57-38.1.x86_64 x11-video-nvidiaG05-450.57-38.1.x86_64 libnvidia-container-tools-1.1.1-1.3.x86_64

 - [x] NVIDIA container library version from `nvidia-container-cli -V`

→ nvidia-container-cli -V version: 1.1.1 build date: 2020-08-25T14:52+00:00 build revision: 1.1.1 build compiler: gcc-10 10.2.1 20200805 [revision dda1e9d08434def88ed86557d08b23251332c5aa] build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -I/usr/include/tirpc -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections


 - [x] NVIDIA container library logs 
no logs created when running container
 - [x] Docker command, image and tag used
any with `--gpus`
klueska commented 4 years ago

This is an error from docker itself before it ever even tries to invoke the nvidia stack. My guess is you have a mismatch on the version of your docker-cli and your actual docker packages.

klueska commented 4 years ago

To check that the nvidia stack is actually working, you can attempt to use the environment variable API instead of the --gpus option (this will require you to install the nvidia-container-runtime package as well though).

docker run --runtime=nvidia --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:latest nvidia-smi
klueska commented 4 years ago

This is how the stack fits together: https://github.com/NVIDIA/nvidia-docker/issues/1268#issuecomment-632692949

s4s0l commented 4 years ago

Thanks for guidelines. Above comment IMO should be part of README.md as is, it's just worth it.

TL;DR sles15.1 repo works in Thumbleweed, and fixes my problem.

In Thumbleweed nvidia container tooling comes from main repo but there is no nvidia-container-runtime. I took a look at its sources and found that, as far as i could tell, there is nothing special in its rpm specs file that could harm Thumbleweed. Same for libnvidia-container and anything else in nvidia-docker repos. So I just went with using sles15.1 repo, upgraded everything as things in Thumbleweed repos were little out of date.

It works.

I still feel like my oryginal question remains unresolved: how does docker "know" nvidia tooling is installed? At this point its purely academic problem.

For any suse newbie encountering same problem, below is what I did.

  sudo zypper rm libnvidia-container-static libnvidia-container-devel libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit
  sudo zypper ar https://nvidia.github.io/nvidia-docker/sles15.1/nvidia-docker.repo
  sudo zypper in libnvidia-container1 nvidia-container-runtime

After that:

  → docker run --rm --gpus all nvidia/cuda:latest nvidia-smi
Mon Aug 31 20:20:38 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:65:00.0  On |                  N/A |
|  0%   50C    P8    17W / 120W |    965MiB /  5941MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

This step is not necessary as it only adds nvidia runtime. It can be done also by modifying /etc/docker/daemon.json but i did it just for fun.

In /usr/lib/systemd/system/docker.service add --add-runtime nvidia=/usr/bin/nvidia-container-runtime to docker start command so it looks like:

  ExecStart=/usr/bin/dockerd --add-runtime nvidia=/usr/bin/nvidia-container-runtime --add-runtime oci=/usr/sbin/docker-runc $DOCKE  R_NETWORK_OPTIONS $DOCKER_OPTS

Then:

  sudo systemctl daemon-reload
  sudo systemctl restart docker

After that:

  → docker run --runtime=nvidia --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:latest nvidia-smi
Mon Aug 31 20:06:38 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:65:00.0  On |                  N/A |
|  0%   50C    P8    16W / 120W |    990MiB /  5941MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+
Medoalmasry commented 1 year ago

@s4s0l I can NOT thank you enough. I have been delving down this rabbit hole for 2 days. Thank you