NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
1.87k stars 214 forks source link

OCI runtime create failed #68

Open hdwmp123 opened 1 year ago

hdwmp123 commented 1 year ago
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/665f824c082a491ade73297c4deaa21441b43ffa827367e3302d5efaa332fade/log.json: no such file or directory): fork/exec /tmp/.X11-unix: permission denied: <nil>: unknown.
ubuntu 20.04
CUDA Version: 11.2
NVIDIA-SMI 460.106.00
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00   Driver Version: 460.106.00   CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 308...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P0    29W /  N/A |     10MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1255      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2426      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+
sudo nvidia-docker run -tid -p 8888:8888 \
    --hostname deepfakes-gpu --name deepfakes-gpu \
    -v /home/administrator/data/deepfakes:/root/faceswap \
    -v /tmp/.X11-unix:/tmp/.X11-unix \
    -e DISPLAY=unix$DISPLAY \
    -e AUDIO_GID=`getent group audio | cut -d: -f3` \
    -e VIDEO_GID=`getent group video | cut -d: -f3` \
    -e GID=`id -g` \
    -e UID=`id -u` \
    deepfakes-gpu
elezar commented 1 year ago

Hi @hdwmp123. Which version of the NVIDIA Container Toolkit are you using?

MaxiBoether commented 1 year ago

Hi @elezar

I am currently facing the same or at least very similar error in a rootless docker setup:

docker run --runtime=nvidia hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/user/5004/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/41d7c05c72f1d1113886eede371c1cdd745c6e3bf36011c38b40c2f50e547459/log.json: no such file or directory): /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1: unknown.
ERRO[0000] error waiting for container:

When using runc as the runtime, the hello world container works.

We are using version 1.13. When running /usr/bin/nvidia-container-runtime run hello-world (not sure what this would do), we get the error ERRO[0000] runc run failed: JSON specification file config.json not found. If we create a file config.json in the current working directory with {} as content, we get the error ERRO[0000] runc run failed: process property must not be empty. I don't exactly know what is going wrong here, maybe you have an idea?

elezar commented 1 year ago

@MaxiBoether the nvidia-container-runtime is a shim for runc or another OCI-compliant runtime and does not implement the docker CLI.

Please provide the following:

  1. More information about your platform including the output from nvidia-smi on the host
  2. Please enabled debug logging by uncommenting / modifying the #debug = lines in the /etc/nvidia-container-runtime/config.toml file. You could also bump the log-level for the runtime to "debug" to produce more logs. Please attach the nvidia-container-toolkit.log and nvidia-container-runtime.log files generated to the issue.
MaxiBoether commented 1 year ago

@elezar

Yes, I understand that it does not implement the docker CLI. I just did this experiment because the output of docker run --runtime=nvidia hello-world (as given above) mentioned error code 1 of /usr/bin/nvidia-container-runtime, I wanted to investigate what the problem might be since there was no clear error msg.

1)

maxilocal4@sgs-gpu04:~$ nvidia-smi
Wed May 24 14:34:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:23:00.0 Off |                  N/A |
|  0%   62C    P2              247W / 350W|  14231MiB / 24576MiB |     59%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         On | 00000000:41:00.0 Off |                  N/A |
|  0%   58C    P2              243W / 350W|  14357MiB / 24576MiB |     75%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090         On | 00000000:A1:00.0 Off |                  N/A |
|  0%   62C    P2              283W / 350W|  14587MiB / 24576MiB |     71%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090         On | 00000000:C1:00.0 Off |                  N/A |
|  0%   59C    P2              254W / 350W|  14587MiB / 24576MiB |     66%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

maxilocal4@sgs-gpu04:~$ uname -a
Linux sgs-gpu04.ethz.ch 5.15.0-67-generic NVIDIA/nvidia-docker#74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

This is the output of nvidia-smi and uname -a on the host machine. I hope that helps.

2) My config.toml looks like that:


disable-require = false

[nvidia-container-cli]
environment = []
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
no-cgroups = true

[nvidia-container-runtime]
debug = "/local/home/maxilocal4/log/nvidia-container-runtime.log"
# levels => debug, info, warning, error
log-level = "debug"

runtimes = ["runc"]
mode = "auto"

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

The log files do not get created for me, probably because the containers do not start in the first place?

Note that this explicitly affects rootless docker. When running docker as root, everything works. Rootless docker with runc also works.

MaxiBoether commented 1 year ago

Okay, we figured out what the problem was. The debug log file was not writable due to filesystem permissions. Maybe it would be cool to add a more verbose error message if writing to the log file fails?

elezar commented 1 year ago

@MaxiBoether do you mean that the original error was caused by the log file not being writable, or the fact that the log wasn't being generated?

Update: I have created https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/404 that ignores the error when opening / creating any log files. Would you be able to test these changes and verify that they stop the behaviour you were seeing?