Closed s4s0l closed 4 years ago
This is an error from docker itself before it ever even tries to invoke the nvidia stack.
My guess is you have a mismatch on the version of your docker-cli
and your actual docker
packages.
To check that the nvidia stack is actually working, you can attempt to use the environment variable API instead of the --gpus
option (this will require you to install the nvidia-container-runtime
package as well though).
docker run --runtime=nvidia --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:latest nvidia-smi
This is how the stack fits together: https://github.com/NVIDIA/nvidia-docker/issues/1268#issuecomment-632692949
Thanks for guidelines. Above comment IMO should be part of README.md as is, it's just worth it.
TL;DR sles15.1 repo works in Thumbleweed, and fixes my problem.
In Thumbleweed nvidia container tooling comes from main repo but there is no nvidia-container-runtime. I took a look at its sources and found that, as far as i could tell, there is nothing special in its rpm specs file that could harm Thumbleweed. Same for libnvidia-container and anything else in nvidia-docker repos. So I just went with using sles15.1 repo, upgraded everything as things in Thumbleweed repos were little out of date.
It works.
I still feel like my oryginal question remains unresolved: how does docker "know" nvidia tooling is installed? At this point its purely academic problem.
For any suse newbie encountering same problem, below is what I did.
sudo zypper rm libnvidia-container-static libnvidia-container-devel libnvidia-container-tools libnvidia-container1 nvidia-container-toolkit
sudo zypper ar https://nvidia.github.io/nvidia-docker/sles15.1/nvidia-docker.repo
sudo zypper in libnvidia-container1 nvidia-container-runtime
After that:
→ docker run --rm --gpus all nvidia/cuda:latest nvidia-smi
Mon Aug 31 20:20:38 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... Off | 00000000:65:00.0 On | N/A |
| 0% 50C P8 17W / 120W | 965MiB / 5941MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
This step is not necessary as it only adds nvidia runtime. It can be done also by modifying /etc/docker/daemon.json but i did it just for fun.
In /usr/lib/systemd/system/docker.service
add --add-runtime nvidia=/usr/bin/nvidia-container-runtime
to docker start command so it looks like:
ExecStart=/usr/bin/dockerd --add-runtime nvidia=/usr/bin/nvidia-container-runtime --add-runtime oci=/usr/sbin/docker-runc $DOCKE R_NETWORK_OPTIONS $DOCKER_OPTS
Then:
sudo systemctl daemon-reload
sudo systemctl restart docker
After that:
→ docker run --runtime=nvidia --rm -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda:latest nvidia-smi
Mon Aug 31 20:06:38 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.57 Driver Version: 450.57 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce GTX 166... Off | 00000000:65:00.0 On | N/A |
| 0% 50C P8 16W / 120W | 990MiB / 5941MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
@s4s0l I can NOT thank you enough. I have been delving down this rabbit hole for 2 days. Thank you
1. Issue or feature description
On tumbleweed (i know it's not supported) I'm unable to run:
Generally I'm not expecting solution but rather would like to understand how all of this should work together. My current findings are that my docker is not using
/usr/share/containers/oci/hooks.d/oci-nvidia-hook.json
at all. I can put there any nonsense and I do not see anything complaining about it. I would like to understand why? As there is quite little documentation on docker side about how it uses oci hooks i do not know where to look for explanation. What should pick it up and under what circumstances? Is it docker itself or runc or ... ? I see in some packages on other distros hooks are placed in different paths like `/usr/share/containers/docker/...' or '/etc/containers/...'. I tried different versions of runc and some more random things but after reading documentation of docker, nvidia repos, oci specs i still cannot figure out how does it supposed to work. I would appreciate if someone could find a moment to write down how nvidia tools are integrated with docker. How does docker pick gpu "driver"? What makes nvidia hook to trigger only for containers started with '--gpus'? etc...Driver seems to be running fine as far as i can tell (games, cuda based ML, blender). Issues i could find relate to docker not restarted after installation of toolkit or docker installed via snap, not my case.
2. Steps to reproduce the issue
Install docker nvidia drivers and nvidia-container-toolkit run container with --gpus .
3. Information
nvidia-container-cli -k -d /dev/tty info
nvidia-container-cli.txtuname -a
dmesg
→ rpm -qa 'nvidia' kernel-firmware-nvidia-20200807-1.2.noarch libnvidia-container-static-1.1.1-1.3.x86_64 nvidia-container-toolkit-0.0+git.1580519869.60f165a-1.4.x86_64 libnvidia-container-devel-1.1.1-1.3.x86_64 libnvidia-container1-1.1.1-1.3.x86_64 nvidia-gfxG05-kmp-default-450.57_k5.7.9_1-38.2.x86_64 nvidia-glG05-450.57-38.1.x86_64 nvidia-computeG05-450.57-38.1.x86_64 x11-video-nvidiaG05-450.57-38.1.x86_64 libnvidia-container-tools-1.1.1-1.3.x86_64
→ nvidia-container-cli -V version: 1.1.1 build date: 2020-08-25T14:52+00:00 build revision: 1.1.1 build compiler: gcc-10 10.2.1 20200805 [revision dda1e9d08434def88ed86557d08b23251332c5aa] build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -I/usr/include/tirpc -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections