NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.25k stars 245 forks source link

failed to create shim task #252

Open Golchoubian opened 2 years ago

Golchoubian commented 2 years ago

I installed the nvidia-docker2 following the instructions. When running the following command I will get the expected output as shown.

sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.29.05    Driver Version: 495.29.05    CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:0B:00.0  On |                  N/A |
| 24%   31C    P8    13W / 250W |    222MiB / 11011MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

However running the above command without "sudo" results in the following error for me:

$ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

Here is some additional information regarding my issue:

 $ nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0620 19:28:50.255712 7268 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0620 19:28:50.255761 7268 nvc.c:350] using root /
I0620 19:28:50.255768 7268 nvc.c:351] using ldcache /etc/ld.so.cache
I0620 19:28:50.255776 7268 nvc.c:352] using unprivileged user 1000:1000
I0620 19:28:50.255801 7268 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0620 19:28:50.255949 7268 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0620 19:28:50.257621 7269 nvc.c:273] failed to set inheritable capabilities
W0620 19:28:50.257682 7269 nvc.c:274] skipping kernel modules load due to failure
I0620 19:28:50.258008 7270 rpc.c:71] starting driver rpc service
I0620 19:28:50.261063 7271 rpc.c:71] starting nvcgo rpc service
I0620 19:28:50.262220 7268 nvc_info.c:766] requesting driver information with ''
I0620 19:28:50.264525 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.495.29.05
I0620 19:28:50.264601 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.495.29.05
I0620 19:28:50.264647 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.495.29.05
I0620 19:28:50.264693 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.495.29.05
I0620 19:28:50.264758 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.495.29.05
I0620 19:28:50.264821 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.495.29.05
I0620 19:28:50.264869 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.495.29.05
I0620 19:28:50.264914 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.495.29.05
I0620 19:28:50.264979 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.495.29.05
I0620 19:28:50.265021 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.495.29.05
I0620 19:28:50.265061 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.495.29.05
I0620 19:28:50.265104 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.495.29.05
I0620 19:28:50.265189 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.495.29.05
I0620 19:28:50.265255 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.495.29.05
I0620 19:28:50.265301 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.495.29.05
I0620 19:28:50.265350 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.495.29.05
I0620 19:28:50.265429 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.495.29.05
I0620 19:28:50.265501 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.495.29.05
I0620 19:28:50.265975 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.495.29.05
I0620 19:28:50.266302 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.495.29.05
I0620 19:28:50.266350 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.495.29.05
I0620 19:28:50.266394 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.495.29.05
I0620 19:28:50.266444 7268 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.495.29.05
W0620 19:28:50.266522 7268 nvc_info.c:399] missing library libnvidia-nscq.so
W0620 19:28:50.266530 7268 nvc_info.c:399] missing library libcudadebugger.so
W0620 19:28:50.266542 7268 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0620 19:28:50.266551 7268 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0620 19:28:50.266558 7268 nvc_info.c:399] missing library libvdpau_nvidia.so
W0620 19:28:50.266573 7268 nvc_info.c:399] missing library libnvidia-ifr.so
W0620 19:28:50.266586 7268 nvc_info.c:399] missing library libnvidia-cbl.so
W0620 19:28:50.266593 7268 nvc_info.c:403] missing compat32 library libnvidia-ml.so
W0620 19:28:50.266609 7268 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0620 19:28:50.266616 7268 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0620 19:28:50.266627 7268 nvc_info.c:403] missing compat32 library libcuda.so
W0620 19:28:50.266634 7268 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0620 19:28:50.266641 7268 nvc_info.c:403] missing compat32 library libnvidia-opencl.so
W0620 19:28:50.266648 7268 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so
W0620 19:28:50.266662 7268 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0620 19:28:50.266673 7268 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0620 19:28:50.266684 7268 nvc_info.c:403] missing compat32 library libnvidia-compiler.so
W0620 19:28:50.266691 7268 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0620 19:28:50.266705 7268 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W0620 19:28:50.266716 7268 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0620 19:28:50.266724 7268 nvc_info.c:403] missing compat32 library libnvidia-encode.so
W0620 19:28:50.266731 7268 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so
W0620 19:28:50.266749 7268 nvc_info.c:403] missing compat32 library libnvcuvid.so
W0620 19:28:50.266765 7268 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W0620 19:28:50.266775 7268 nvc_info.c:403] missing compat32 library libnvidia-glcore.so
W0620 19:28:50.266784 7268 nvc_info.c:403] missing compat32 library libnvidia-tls.so
W0620 19:28:50.266791 7268 nvc_info.c:403] missing compat32 library libnvidia-glsi.so
W0620 19:28:50.266802 7268 nvc_info.c:403] missing compat32 library libnvidia-fbc.so
W0620 19:28:50.266811 7268 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W0620 19:28:50.266821 7268 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W0620 19:28:50.266833 7268 nvc_info.c:403] missing compat32 library libnvoptix.so
W0620 19:28:50.266845 7268 nvc_info.c:403] missing compat32 library libGLX_nvidia.so
W0620 19:28:50.266852 7268 nvc_info.c:403] missing compat32 library libEGL_nvidia.so
W0620 19:28:50.266860 7268 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so
W0620 19:28:50.266870 7268 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so
W0620 19:28:50.266883 7268 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so
W0620 19:28:50.266899 7268 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0620 19:28:50.267504 7268 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I0620 19:28:50.267531 7268 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0620 19:28:50.267554 7268 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0620 19:28:50.267594 7268 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I0620 19:28:50.267620 7268 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W0620 19:28:50.267715 7268 nvc_info.c:425] missing binary nv-fabricmanager
I0620 19:28:50.267757 7268 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/495.29.05/gsp.bin
I0620 19:28:50.267793 7268 nvc_info.c:529] listing device /dev/nvidiactl
I0620 19:28:50.267800 7268 nvc_info.c:529] listing device /dev/nvidia-uvm
I0620 19:28:50.267814 7268 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0620 19:28:50.267827 7268 nvc_info.c:529] listing device /dev/nvidia-modeset
I0620 19:28:50.267861 7268 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W0620 19:28:50.267888 7268 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0620 19:28:50.267911 7268 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0620 19:28:50.267918 7268 nvc_info.c:822] requesting device information with ''
I0620 19:28:50.273834 7268 nvc_info.c:713] listing device /dev/nvidia0 (GPU-9ee0d68c-33fa-4d2e-4654-f463ee6394b8 at 00000000:0b:00.0)
NVRM version:   495.29.05
CUDA version:   11.5

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce RTX 2080 Ti
Brand:          GeForce
GPU UUID:       GPU-9ee0d68c-33fa-4d2e-4654-f463ee6394b8
Bus Location:   00000000:0b:00.0
Architecture:   7.5
I0620 19:28:50.273874 7268 nvc.c:434] shutting down library context
I0620 19:28:50.273939 7271 rpc.c:95] terminating nvcgo rpc service
I0620 19:28:50.274644 7268 rpc.c:135] nvcgo rpc service terminated successfully
I0620 19:28:50.275479 7270 rpc.c:95] terminating driver rpc service
I0620 19:28:50.275590 7268 rpc.c:135] driver rpc service terminated successfully

Can you please instruct me on how I can solve this?

elezar commented 2 years ago

@Golchoubian could you enable debug logging in the NVIDIA Container CLI by uncommenting the line:

#debug = "/var/log/nvidia-container-toolkit.log"

in /etc/nvidia-container-runtime/config.toml, repeating the failed run, and then attaching the /var/log/nvidia-container-toolkit.log here. From the message it would seem as if the ldcache is not being updated correctly in the container.

We could also run the following to confirm:

docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 bash -c "ldconfig; nvidia-smi"

Could you provide more information on your host system and whether something might be preventing /sbin/ldconfig from being run by the NVIDIA Container Runtime Hook?

Golchoubian commented 2 years ago

@elezar Despite uncommenting the debug line that you mentioned, no nvidia-container-toolkit.log was created using the failed run, unless I again use the sudo with it, which afterwards gives me the following attached file. nvidia-container-toolkit.log

When I run the second command that you shared, I get the same error as follows:

$ docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 bash -c "ldconfig; nvidia-smi"
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

here is more more information on my system:

Linux mahsa 5.13.0-48-generic NVIDIA/nvidia-docker#54~20.04.1-Ubuntu SMP Thu Jun 2 23:37:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
reppertj commented 1 year ago

@elezar

I don't quite have the full story as I'm unfamiliar with the details of NVIDIA container runtime's architecture, but I was seeing the same behavior (could run nvidia-smi in containers with sudo but not without, with the same load library failed: libnvidia-ml.so.1 error) and can add the following that applied at least in my case.

Importantly, I had installed Docker Desktop (on Ubuntu 22.04), which set the docker CLI's current context to use unix:///home/$USER/.docker/desktop/docker.sock as the docker endpoint:

❯ docker context inspect $(docker context show)
[
    {
        "Name": "desktop-linux",
        "Metadata": {},
        "Endpoints": {
            "docker": {
                "Host": "unix:///home/$USER/.docker/desktop/docker.sock",
                "SkipTLSVerify": false
            }
        },
        "TLSMaterial": {},
        "Storage": {
            "MetadataPath": "/home/$USER/.docker/contexts/meta/fe9c6bd7a66301f49ca9b6a70b217107cd1284598bfc254700c989b916da791e",
            "TLSPath": "/home/$USER/.docker/contexts/tls/fe9c6bd7a66301f49ca9b6a70b217107cd1284598bfc254700c989b916da791e"
        }
    }
]

After switching back to the default context:

docker context use default

This set the docker endpoint back to unix:///var/run/docker.sock

And I was able to run:

docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

as expected.

So it seems that the docker desktop docker host is somehow interfering here, at least in my case.

Happy to provide more information here in case it helps.