NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.25k stars 245 forks source link

nvidia-smi failed to solve error when running a simple docker build #218

Open vaskokj opened 1 year ago

vaskokj commented 1 year ago

1. Issue or feature description

Getting the following error when running a docker build.

ERROR: failed to solve: process "/bin/sh -c nvidia-smi" did not complete successfully: exit code: 127

2. Steps to reproduce the issue

sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
cat <<EOT >> Dockerfile
> FROM nvidia/cuda:11.6.2-base-ubuntu20.04
> RUN nvidia-smi
> EOT

docker build -t nvidiadockertest -f Dockerfile .

Will generate

docker build -t testdocker -f Dockerfile .
[+] Building 0.4s (5/5) FINISHED                                                                           
 => [internal] load build definition from Dockerfile                                                  0.0s
 => => transferring dockerfile: 94B                                                                   0.0s
 => [internal] load .dockerignore                                                                     0.0s
 => => transferring context: 2B                                                                       0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04                        0.0s
 => CACHED [1/2] FROM docker.io/nvidia/cuda:11.6.2-base-ubuntu20.04                                   0.0s
 => ERROR [2/2] RUN nvidia-smi                                                                        0.4s
------                                                                                                     
 > [2/2] RUN nvidia-smi:
#0 0.353 /bin/sh: 1: nvidia-smi: not found
------
Dockerfile:3
--------------------
   1 |     FROM nvidia/cuda:11.6.2-base-ubuntu20.04
   2 |     
   3 | >>> RUN nvidia-smi
   4 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c nvidia-smi" did not complete successfully: exit code: 127

3. Information to attach (optional if deemed irrelevant)

I0421 15:13:46.930494 37148 nvc.c:376] initializing library context (version=1.13.0, build=20823911e978a50b33823a5783f92b6e345b241a)
I0421 15:13:46.930619 37148 nvc.c:350] using root /
I0421 15:13:46.930641 37148 nvc.c:351] using ldcache /etc/ld.so.cache
I0421 15:13:46.930662 37148 nvc.c:352] using unprivileged user 2734715:2734715
I0421 15:13:46.930716 37148 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0421 15:13:46.931319 37148 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0421 15:13:46.935919 37149 nvc.c:273] failed to set inheritable capabilities
W0421 15:13:46.936063 37149 nvc.c:274] skipping kernel modules load due to failure
I0421 15:13:46.936659 37150 rpc.c:71] starting driver rpc service
I0421 15:13:46.955269 37151 rpc.c:71] starting nvcgo rpc service
I0421 15:13:46.957813 37148 nvc_info.c:796] requesting driver information with ''
I0421 15:13:46.960388 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.525.105.17
I0421 15:13:46.960499 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.525.105.17
I0421 15:13:46.960593 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.525.105.17
I0421 15:13:46.960672 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.525.105.17
I0421 15:13:46.960782 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.525.105.17
I0421 15:13:46.960892 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.525.105.17
I0421 15:13:46.960965 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.525.105.17
I0421 15:13:46.961077 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.525.105.17
I0421 15:13:46.961149 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.525.105.17
I0421 15:13:46.961260 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.525.105.17
I0421 15:13:46.961333 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.525.105.17
I0421 15:13:46.961403 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.525.105.17
I0421 15:13:46.961495 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.525.105.17
I0421 15:13:46.961605 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.525.105.17
I0421 15:13:46.961716 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.525.105.17
I0421 15:13:46.961793 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.525.105.17
I0421 15:13:46.961868 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.525.105.17
I0421 15:13:46.961981 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.525.105.17
I0421 15:13:46.962094 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.525.105.17
I0421 15:13:46.962835 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.525.105.17
I0421 15:13:46.962905 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.525.105.17
I0421 15:13:46.963302 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.525.105.17
I0421 15:13:46.963381 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.525.105.17
I0421 15:13:46.963458 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.525.105.17
I0421 15:13:46.963537 37148 nvc_info.c:174] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.525.105.17
I0421 15:13:46.963672 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.525.105.17
I0421 15:13:46.963746 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.525.105.17
I0421 15:13:46.963853 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.525.105.17
I0421 15:13:46.963960 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.525.105.17
I0421 15:13:46.964034 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-nvvm.so.525.105.17
I0421 15:13:46.964144 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.525.105.17
I0421 15:13:46.964249 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.525.105.17
I0421 15:13:46.964319 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.525.105.17
I0421 15:13:46.964388 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.525.105.17
I0421 15:13:46.964460 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.525.105.17
I0421 15:13:46.964579 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.525.105.17
I0421 15:13:46.964688 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.525.105.17
I0421 15:13:46.964758 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.525.105.17
I0421 15:13:46.964830 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.525.105.17
I0421 15:13:46.964973 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libcuda.so.525.105.17
I0421 15:13:46.965112 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.525.105.17
I0421 15:13:46.965188 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.525.105.17
I0421 15:13:46.965262 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.525.105.17
I0421 15:13:46.965338 37148 nvc_info.c:174] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.525.105.17
W0421 15:13:46.965378 37148 nvc_info.c:400] missing library libnvidia-nscq.so
W0421 15:13:46.965392 37148 nvc_info.c:400] missing library libnvidia-fatbinaryloader.so
W0421 15:13:46.965406 37148 nvc_info.c:400] missing library libnvidia-pkcs11.so
W0421 15:13:46.965417 37148 nvc_info.c:400] missing library libvdpau_nvidia.so
W0421 15:13:46.965432 37148 nvc_info.c:400] missing library libnvidia-ifr.so
W0421 15:13:46.965446 37148 nvc_info.c:400] missing library libnvidia-cbl.so
W0421 15:13:46.965458 37148 nvc_info.c:404] missing compat32 library libnvidia-cfg.so
W0421 15:13:46.965472 37148 nvc_info.c:404] missing compat32 library libnvidia-nscq.so
W0421 15:13:46.965486 37148 nvc_info.c:404] missing compat32 library libcudadebugger.so
W0421 15:13:46.965497 37148 nvc_info.c:404] missing compat32 library libnvidia-fatbinaryloader.so
W0421 15:13:46.965510 37148 nvc_info.c:404] missing compat32 library libnvidia-allocator.so
W0421 15:13:46.965524 37148 nvc_info.c:404] missing compat32 library libnvidia-pkcs11.so
W0421 15:13:46.965543 37148 nvc_info.c:404] missing compat32 library libnvidia-ngx.so
W0421 15:13:46.965556 37148 nvc_info.c:404] missing compat32 library libvdpau_nvidia.so
W0421 15:13:46.965570 37148 nvc_info.c:404] missing compat32 library libnvidia-ifr.so
W0421 15:13:46.965585 37148 nvc_info.c:404] missing compat32 library libnvidia-rtcore.so
W0421 15:13:46.965597 37148 nvc_info.c:404] missing compat32 library libnvoptix.so
W0421 15:13:46.965612 37148 nvc_info.c:404] missing compat32 library libnvidia-cbl.so
I0421 15:13:46.966432 37148 nvc_info.c:300] selecting /usr/bin/nvidia-smi
I0421 15:13:46.966476 37148 nvc_info.c:300] selecting /usr/bin/nvidia-debugdump
I0421 15:13:46.966517 37148 nvc_info.c:300] selecting /usr/bin/nvidia-persistenced
I0421 15:13:46.966585 37148 nvc_info.c:300] selecting /usr/bin/nvidia-cuda-mps-control
I0421 15:13:46.966628 37148 nvc_info.c:300] selecting /usr/bin/nvidia-cuda-mps-server
W0421 15:13:46.966789 37148 nvc_info.c:426] missing binary nv-fabricmanager
I0421 15:13:46.966875 37148 nvc_info.c:486] listing firmware path /lib/firmware/nvidia/525.105.17/gsp_ad10x.bin
I0421 15:13:46.966888 37148 nvc_info.c:486] listing firmware path /lib/firmware/nvidia/525.105.17/gsp_tu10x.bin
I0421 15:13:46.966944 37148 nvc_info.c:559] listing device /dev/nvidiactl
I0421 15:13:46.966957 37148 nvc_info.c:559] listing device /dev/nvidia-uvm
I0421 15:13:46.966970 37148 nvc_info.c:559] listing device /dev/nvidia-uvm-tools
I0421 15:13:46.966983 37148 nvc_info.c:559] listing device /dev/nvidia-modeset
I0421 15:13:46.967043 37148 nvc_info.c:344] listing ipc path /run/nvidia-persistenced/socket
W0421 15:13:46.967090 37148 nvc_info.c:350] missing ipc path /var/run/nvidia-fabricmanager/socket
W0421 15:13:46.967125 37148 nvc_info.c:350] missing ipc path /tmp/nvidia-mps
I0421 15:13:46.967139 37148 nvc_info.c:852] requesting device information with ''
I0421 15:13:46.973925 37148 nvc_info.c:743] listing device /dev/nvidia0 (GPU-fa0d42b9-4cc2-19db-3048-663761d3bc94 at 00000000:b3:00.0)
NVRM version:   525.105.17
CUDA version:   12.0

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce GTX 1080
Brand:          GeForce
GPU UUID:       GPU-fa0d42b9-4cc2-19db-3048-663761d3bc94
Bus Location:   00000000:b3:00.0
Architecture:   6.1
I0421 15:13:46.974057 37148 nvc.c:434] shutting down library context
I0421 15:13:46.974152 37151 rpc.c:95] terminating nvcgo rpc service
I0421 15:13:46.975123 37148 rpc.c:135] nvcgo rpc service terminated successfully
I0421 15:13:46.977843 37150 rpc.c:95] terminating driver rpc service
I0421 15:13:46.978078 37148 rpc.c:135] driver rpc service terminated successfully

Linux pc4 5.15.0-69-generic NVIDIA/nvidia-docker#76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Server: Docker Engine - Community Engine: Version: 23.0.4 API version: 1.42 (minimum version 1.12) Go version: go1.19.8 Git commit: cbce331 Built: Fri Apr 14 10:32:23 2023 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.20 GitCommit: 2806fc1057397dbaeefbea0e4e17bddfbd388f38 nvidia: Version: 1.1.5 GitCommit: v1.1.5-0-gf19387a docker-init: Version: 0.19.0 GitCommit: de40ad0

 - [ ] NVIDIA container library version from `nvidia-container-cli -V`

cli-version: 1.13.0 lib-version: 1.13.0 build date: 2023-03-31T13:12+00:00 build revision: 20823911e978a50b33823a5783f92b6e345b241a build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections


A workaround I have found is..

sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin VERSION_STRING=5:20.10.24~3-0~ubuntu-focal sudo apt-get install docker-ce=$VERSION_STRING docker-ce-cli=$VERSION_STRING containerd.io docker-buildx-plugin docker-compose-plugin


Edit `/etc/docker/daemon.json` and add `"default-runtime": "nvidia",`

$ cat /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "args": [], "path": "nvidia-container-runtime" } } }

Restart docker service:

`sudo systemctl restart docker`

At that point it works as expected:

$ docker build -t testdocker -f Dockerfile . Sending build context to Docker daemon 4.096kB Step 1/2 : FROM nvidia/cuda:11.6.2-base-ubuntu20.04 ---> 1aa90fd39a71 Step 2/2 : RUN nvidia-smi ---> Running in 92d37956523c Fri Apr 21 15:32:14 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:B3:00.0 On | N/A | | 27% 33C P0 42W / 180W | 462MiB / 8192MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ Removing intermediate container 92d37956523c ---> 0956df366f8c Successfully built 0956df366f8c Successfully tagged testdocker:latest

elezar commented 1 year ago

The NVIDIA Container Toolkit and it's components are intended to make GPUs available when running containers and not when building them.

It does work to set the default runtime, but the injection of the NVIDIA GPU driver libraries into the container at this stage break portability of the container and also results in many (.so.1) files being present in the container that may cause issues when run.

Is there a specific reason that you're trying to run GPU code when building a container?

vaskokj commented 1 year ago

The NVIDIA Container Toolkit and it's components are intended to make GPUs available when running containers and not when building them.

It does work to set the default runtime, but the injection of the NVIDIA GPU driver libraries into the container at this stage break portability of the container and also results in many (.so.1) files being present in the container that may cause issues when run.

Is there a specific reason that you're trying to run GPU code when building a container?

So we aren't really building anything but just ran into builds failing as it "worked in the past".

We have a build server running and it got updated to the latest docker. People were running their docker builds and one of them had a RUN nvidia-smi just as a sanity check to validate that the nvidia libraries / nvidia-docker stuff was working properly and present (I don't believe they are actually compiling anything, in this case), but going to the later docker version broke that build.

It was a "we upgraded to a later version of docker and now our Dockerfile no longer works".

A few other people have been hit unexpectedly with this: https://forums.developer.nvidia.com/t/nvidia-driver-is-not-available-on-latest-docker/246265 https://stackoverflow.com/a/75629058/3314194

vaskokj commented 1 year ago

The NVIDIA Container Toolkit and it's components are intended to make GPUs available when running containers and not when building them.

It does work to set the default runtime, but the injection of the NVIDIA GPU driver libraries into the container at this stage break portability of the container and also results in many (.so.1) files being present in the container that may cause issues when run.

Is there a specific reason that you're trying to run GPU code when building a container?

Ok so we ran into a use case. We are using our own containers and are having to build Apex (https://github.com/NVIDIA/apex) in the container. What are our options?

youki-sada commented 1 year ago

I am having same issue. This issue occurs on only docker>=23.0. Is there any concrete solution on latest docker?