Open vaskokj opened 1 year ago
The NVIDIA Container Toolkit and it's components are intended to make GPUs available when running containers and not when building them.
It does work to set the default runtime, but the injection of the NVIDIA GPU driver libraries into the container at this stage break portability of the container and also results in many (.so.1) files being present in the container that may cause issues when run.
Is there a specific reason that you're trying to run GPU code when building a container?
The NVIDIA Container Toolkit and it's components are intended to make GPUs available when running containers and not when building them.
It does work to set the default runtime, but the injection of the NVIDIA GPU driver libraries into the container at this stage break portability of the container and also results in many (.so.1) files being present in the container that may cause issues when run.
Is there a specific reason that you're trying to run GPU code when building a container?
So we aren't really building anything but just ran into builds failing as it "worked in the past".
We have a build server running and it got updated to the latest docker. People were running their docker builds and one of them had a RUN nvidia-smi
just as a sanity check to validate that the nvidia libraries / nvidia-docker stuff was working properly and present (I don't believe they are actually compiling anything, in this case), but going to the later docker version broke that build.
It was a "we upgraded to a later version of docker and now our Dockerfile no longer works".
A few other people have been hit unexpectedly with this: https://forums.developer.nvidia.com/t/nvidia-driver-is-not-available-on-latest-docker/246265 https://stackoverflow.com/a/75629058/3314194
The NVIDIA Container Toolkit and it's components are intended to make GPUs available when running containers and not when building them.
It does work to set the default runtime, but the injection of the NVIDIA GPU driver libraries into the container at this stage break portability of the container and also results in many (.so.1) files being present in the container that may cause issues when run.
Is there a specific reason that you're trying to run GPU code when building a container?
Ok so we ran into a use case. We are using our own containers and are having to build Apex (https://github.com/NVIDIA/apex) in the container. What are our options?
I am having same issue. This issue occurs on only docker>=23.0. Is there any concrete solution on latest docker?
1. Issue or feature description
Getting the following error when running a docker build.
ERROR: failed to solve: process "/bin/sh -c nvidia-smi" did not complete successfully: exit code: 127
2. Steps to reproduce the issue
docker build -t nvidiadockertest -f Dockerfile .
Will generate
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
uname -a
Linux pc4 5.15.0-69-generic NVIDIA/nvidia-docker#76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
docker version
Server: Docker Engine - Community Engine: Version: 23.0.4 API version: 1.42 (minimum version 1.12) Go version: go1.19.8 Git commit: cbce331 Built: Fri Apr 14 10:32:23 2023 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.20 GitCommit: 2806fc1057397dbaeefbea0e4e17bddfbd388f38 nvidia: Version: 1.1.5 GitCommit: v1.1.5-0-gf19387a docker-init: Version: 0.19.0 GitCommit: de40ad0
cli-version: 1.13.0 lib-version: 1.13.0 build date: 2023-03-31T13:12+00:00 build revision: 20823911e978a50b33823a5783f92b6e345b241a build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin VERSION_STRING=5:20.10.24~3-0~ubuntu-focal sudo apt-get install docker-ce=$VERSION_STRING docker-ce-cli=$VERSION_STRING containerd.io docker-buildx-plugin docker-compose-plugin
$ cat /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "args": [], "path": "nvidia-container-runtime" } } }
$ docker build -t testdocker -f Dockerfile . Sending build context to Docker daemon 4.096kB Step 1/2 : FROM nvidia/cuda:11.6.2-base-ubuntu20.04 ---> 1aa90fd39a71 Step 2/2 : RUN nvidia-smi ---> Running in 92d37956523c Fri Apr 21 15:32:14 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:B3:00.0 On | N/A | | 27% 33C P0 42W / 180W | 462MiB / 8192MiB | 1% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| +-----------------------------------------------------------------------------+ Removing intermediate container 92d37956523c ---> 0956df366f8c Successfully built 0956df366f8c Successfully tagged testdocker:latest