suse tumbleweed & nvidia-container-toolkit & could not select device driver ""

s4s0l commented 4 years ago

1. Issue or feature description

On tumbleweed (i know it's not supported) I'm unable to run:

→ docker run --rm --gpus all nvidia/cuda:latest nvidia-smi
docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].

Generally I'm not expecting solution but rather would like to understand how all of this should work together. My current findings are that my docker is not using /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json at all. I can put there any nonsense and I do not see anything complaining about it. I would like to understand why? As there is quite little documentation on docker side about how it uses oci hooks i do not know where to look for explanation. What should pick it up and under what circumstances? Is it docker itself or runc or ... ? I see in some packages on other distros hooks are placed in different paths like `/usr/share/containers/docker/...' or '/etc/containers/...'. I tried different versions of runc and some more random things but after reading documentation of docker, nvidia repos, oci specs i still cannot figure out how does it supposed to work. I would appreciate if someone could find a moment to write down how nvidia tools are integrated with docker. How does docker pick gpu "driver"? What makes nvidia hook to trigger only for containers started with '--gpus'? etc...

Driver seems to be running fine as far as i can tell (games, cuda based ML, blender). Issues i could find relate to docker not restarted after installation of toolkit or docker installed via snap, not my case.

2. Steps to reproduce the issue

Install docker nvidia drivers and nvidia-container-toolkit run container with --gpus .

3. Information

[x] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info nvidia-container-cli.txt

[x] Kernel version from uname -a

→ uname -a
Linux sasol-desktop 5.8.2-1-default #1 SMP Wed Aug 19 09:43:15 UTC 2020 (71b519a) x86_64 x86_64 x86_64 GNU/Linux

[x] Any relevant kernel output lines from dmesg


→ dmesg | grep nvidia
[   22.224202] nvidia: loading out-of-tree module taints kernel.
[   22.224210] nvidia: module license 'NVIDIA' taints kernel.
[   22.237123] audit: type=1400 audit(1598803048.338:6): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1283 comm="apparmor_parser"
[   22.237125] audit: type=1400 audit(1598803048.338:7): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1283 comm="apparmor_parser"
[   22.335086] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[   22.368261] nvidia-nvlink: Nvlink Core is being initialized, major device number 238
[   22.368745] nvidia 0000:65:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[   22.565981] nvidia-uvm: Loaded the UVM driver, major device number 236.
[   22.873668] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  450.57  Sun Jul  5 14:52:29 UTC 2020
[   22.996273] [drm] [nvidia-drm] [GPU ID 0x00006500] Loading driver
[   22.996276] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:65:00.0 on minor 0
[   23.052745] nvidia-gpu 0000:65:00.3: i2c timeout error e0000000
[   29.000700] caller _nv000743rm+0x1af/0x200 [nvidia] mapping multiple BARs


 - [x] Driver information from `nvidia-smi -a`
[nvidia-smi.txt](https://github.com/NVIDIA/nvidia-docker/files/5149229/nvidia-smi.txt)
 - [x] Docker version from `docker version`
[docker-version.txt](https://github.com/NVIDIA/nvidia-docker/files/5149232/docker-version.txt)
 - [x] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`
[zypper-packages.txt](https://github.com/NVIDIA/nvidia-docker/files/5149243/zypper-packages.txt)

→ rpm -qa 'nvidia' kernel-firmware-nvidia-20200807-1.2.noarch libnvidia-container-static-1.1.1-1.3.x86_64 nvidia-container-toolkit-0.0+git.1580519869.60f165a-1.4.x86_64 libnvidia-container-devel-1.1.1-1.3.x86_64 libnvidia-container1-1.1.1-1.3.x86_64 nvidia-gfxG05-kmp-default-450.57_k5.7.9_1-38.2.x86_64 nvidia-glG05-450.57-38.1.x86_64 nvidia-computeG05-450.57-38.1.x86_64 x11-video-nvidiaG05-450.57-38.1.x86_64 libnvidia-container-tools-1.1.1-1.3.x86_64

 - [x] NVIDIA container library version from `nvidia-container-cli -V`

→ nvidia-container-cli -V version: 1.1.1 build date: 2020-08-25T14:52+00:00 build revision: 1.1.1 build compiler: gcc-10 10.2.1 20200805 [revision dda1e9d08434def88ed86557d08b23251332c5aa] build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -I/usr/include/tirpc -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections


 - [x] NVIDIA container library logs 
no logs created when running container
 - [x] Docker command, image and tag used
any with `--gpus`

klueska commented 4 years ago

This is an error from docker itself before it ever even tries to invoke the nvidia stack. My guess is you have a mismatch on the version of your docker-cli and your actual docker packages.

klueska commented 4 years ago

To check that the nvidia stack is actually working, you can attempt to use the environment variable API instead of the --gpus option (this will require you to install the nvidia-container-runtime package as well though).

docker run --runtime=nvidia --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:latest nvidia-smi

klueska commented 4 years ago

This is how the stack fits together: https://github.com/NVIDIA/nvidia-docker/issues/1268#issuecomment-632692949

s4s0l commented 4 years ago

Thanks for guidelines. Above comment IMO should be part of README.md as is, it's just worth it.

TL;DR sles15.1 repo works in Thumbleweed, and fixes my problem.

In Thumbleweed nvidia container tooling comes from main repo but there is no nvidia-container-runtime. I took a look at its sources and found that, as far as i could tell, there is nothing special in its rpm specs file that could harm Thumbleweed. Same for libnvidia-container and anything else in nvidia-docker repos. So I just went with using sles15.1 repo, upgraded everything as things in Thumbleweed repos were little out of date.

It works.

I still feel like my oryginal question remains unresolved: how does docker "know" nvidia tooling is installed? At this point its purely academic problem.

For any suse newbie encountering same problem, below is what I did.

  sudo zypper rm libnvidia-container-static libnvidia-container-devel libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit
  sudo zypper ar https://nvidia.github.io/nvidia-docker/sles15.1/nvidia-docker.repo
  sudo zypper in libnvidia-container1 nvidia-container-runtime

After that:

  → docker run --rm --gpus all nvidia/cuda:latest nvidia-smi
Mon Aug 31 20:20:38 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:65:00.0  On |                  N/A |
|  0%   50C    P8    17W / 120W |    965MiB /  5941MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

This step is not necessary as it only adds nvidia runtime. It can be done also by modifying /etc/docker/daemon.json but i did it just for fun.

In /usr/lib/systemd/system/docker.service add --add-runtime nvidia=/usr/bin/nvidia-container-runtime to docker start command so it looks like:

  ExecStart=/usr/bin/dockerd --add-runtime nvidia=/usr/bin/nvidia-container-runtime --add-runtime oci=/usr/sbin/docker-runc $DOCKE  R_NETWORK_OPTIONS $DOCKER_OPTS

Then:

  sudo systemctl daemon-reload
  sudo systemctl restart docker

After that:

  → docker run --runtime=nvidia --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:latest nvidia-smi
Mon Aug 31 20:06:38 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57       Driver Version: 450.57       CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 166...  Off  | 00000000:65:00.0  On |                  N/A |
|  0%   50C    P8    16W / 120W |    990MiB /  5941MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Medoalmasry commented 1 year ago

@s4s0l I can NOT thank you enough. I have been delving down this rabbit hole for 2 days. Thank you

NVIDIA / nvidia-docker