Open hdwmp123 opened 1 year ago
Hi @hdwmp123. Which version of the NVIDIA Container Toolkit are you using?
Hi @elezar
I am currently facing the same or at least very similar error in a rootless docker setup:
docker run --runtime=nvidia hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/user/5004/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/41d7c05c72f1d1113886eede371c1cdd745c6e3bf36011c38b40c2f50e547459/log.json: no such file or directory): /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1: unknown.
ERRO[0000] error waiting for container:
When using runc as the runtime, the hello world container works.
We are using version 1.13. When running /usr/bin/nvidia-container-runtime run hello-world
(not sure what this would do), we get the error ERRO[0000] runc run failed: JSON specification file config.json not found
. If we create a file config.json in the current working directory with {} as content, we get the error ERRO[0000] runc run failed: process property must not be empty
. I don't exactly know what is going wrong here, maybe you have an idea?
@MaxiBoether the nvidia-container-runtime
is a shim for runc
or another OCI-compliant runtime and does not implement the docker CLI.
Please provide the following:
nvidia-smi
on the host#debug =
lines in the /etc/nvidia-container-runtime/config.toml
file. You could also bump the log-level for the runtime to "debug"
to produce more logs. Please attach the nvidia-container-toolkit.log
and nvidia-container-runtime.log
files generated to the issue.@elezar
Yes, I understand that it does not implement the docker CLI. I just did this experiment because the output of docker run --runtime=nvidia hello-world
(as given above) mentioned error code 1 of /usr/bin/nvidia-container-runtime, I wanted to investigate what the problem might be since there was no clear error msg.
1)
maxilocal4@sgs-gpu04:~$ nvidia-smi
Wed May 24 14:34:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:23:00.0 Off | N/A |
| 0% 62C P2 247W / 350W| 14231MiB / 24576MiB | 59% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 3090 On | 00000000:41:00.0 Off | N/A |
| 0% 58C P2 243W / 350W| 14357MiB / 24576MiB | 75% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA GeForce RTX 3090 On | 00000000:A1:00.0 Off | N/A |
| 0% 62C P2 283W / 350W| 14587MiB / 24576MiB | 71% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA GeForce RTX 3090 On | 00000000:C1:00.0 Off | N/A |
| 0% 59C P2 254W / 350W| 14587MiB / 24576MiB | 66% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
maxilocal4@sgs-gpu04:~$ uname -a
Linux sgs-gpu04.ethz.ch 5.15.0-67-generic NVIDIA/nvidia-docker#74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
This is the output of nvidia-smi and uname -a on the host machine. I hope that helps.
2) My config.toml looks like that:
disable-require = false
[nvidia-container-cli]
environment = []
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
no-cgroups = true
[nvidia-container-runtime]
debug = "/local/home/maxilocal4/log/nvidia-container-runtime.log"
# levels => debug, info, warning, error
log-level = "debug"
runtimes = ["runc"]
mode = "auto"
[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
The log files do not get created for me, probably because the containers do not start in the first place?
Note that this explicitly affects rootless docker. When running docker as root, everything works. Rootless docker with runc
also works.
Okay, we figured out what the problem was. The debug log file was not writable due to filesystem permissions. Maybe it would be cool to add a more verbose error message if writing to the log file fails?
@MaxiBoether do you mean that the original error was caused by the log file not being writable, or the fact that the log wasn't being generated?
Update: I have created https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/404 that ignores the error when opening / creating any log files. Would you be able to test these changes and verify that they stop the behaviour you were seeing?