NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.22k stars 241 forks source link

nvidia-smi is not mounted into container #672

Open gfrankliu opened 2 weeks ago

gfrankliu commented 2 weeks ago

I am using the VM (Debian 12) in the GCP Cloud with GPU attached.

gfrankliu-t4-ws ➜  ~ export PATH=/var/lib/nvidia/bin:$PATH
gfrankliu-t4-ws ➜  ~ export LD_LIBRARY_PATH=/var/lib/nvidia/lib64
gfrankliu-t4-ws ➜  ~ which nvidia-smi
/var/lib/nvidia/bin/nvidia-smi
gfrankliu-t4-ws ➜  ~ nvidia-smi
Thu Aug 29 16:49:46 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8              10W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

I have docker (default installation from Debian 12) on the VM, and installed nvidia container toolkit:

    curl -sL https://nvidia.github.io/libnvidia-container/gpgkey -o /etc/apt/trusted.gpg.d/libnvidia-container.asc && \
    curl -sL https://nvidia.github.io/libnvidia-container/debian11/libnvidia-container.list \
    -o /etc/apt/sources.list.d/nvidia-container-toolkit.list && \
    apt-get update && apt-get install -qy --no-install-recommends \
    nvidia-docker2 nvidia-container-runtime

I am testing the docker but it can't find nvidia-smi:

gfrankliu-t4-ws ➜  ~ docker run --privileged --rm --gpus all -it nvcr.io/nvidia/cuda nvidia-smi 
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown.
gfrankliu-t4-ws ➜  ~ 
gfrankliu-t4-ws ➜  ~ nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0829 16:57:45.349604 1151 nvc.c:393] initializing library context (version=1.16.1, build=4c2494f16573b585788a42e9c7bee76ecd48c73d)
I0829 16:57:45.350042 1151 nvc.c:364] using root /
I0829 16:57:45.350057 1151 nvc.c:365] using ldcache /etc/ld.so.cache
I0829 16:57:45.350067 1151 nvc.c:366] using unprivileged user 1001:1001
I0829 16:57:45.350117 1151 nvc.c:410] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0829 16:57:45.352081 1151 nvc.c:412] dxcore initialization failed, continuing assuming a non-WSL environment
W0829 16:57:45.353348 1152 nvc.c:273] failed to set inheritable capabilities
W0829 16:57:45.353412 1152 nvc.c:274] skipping kernel modules load due to failure
I0829 16:57:45.353835 1153 rpc.c:71] starting driver rpc service
I0829 16:57:45.366576 1154 rpc.c:71] starting nvcgo rpc service
I0829 16:57:45.368607 1151 nvc_info.c:797] requesting driver information with ''
I0829 16:57:45.370286 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/vdpau/libvdpau_nvidia.so.535.183.01
I0829 16:57:45.370531 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvoptix.so.535.183.01
I0829 16:57:45.370647 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-tls.so.535.183.01
I0829 16:57:45.370735 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-rtcore.so.535.183.01
I0829 16:57:45.370850 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-ptxjitcompiler.so.535.183.01
I0829 16:57:45.370980 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-pkcs11.so.535.183.01
I0829 16:57:45.371066 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-pkcs11-openssl3.so.535.183.01
I0829 16:57:45.371169 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-opticalflow.so.535.183.01
I0829 16:57:45.371347 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-opencl.so.535.183.01
I0829 16:57:45.371460 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-nvvm.so.535.183.01
I0829 16:57:45.371617 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-ngx.so.535.183.01
I0829 16:57:45.371702 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-ml.so.535.183.01
I0829 16:57:45.371854 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-glvkspirv.so.535.183.01
I0829 16:57:45.371950 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-glsi.so.535.183.01
I0829 16:57:45.372049 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-glcore.so.535.183.01
I0829 16:57:45.372180 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-fbc.so.535.183.01
I0829 16:57:45.372336 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-encode.so.535.183.01
I0829 16:57:45.372501 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-eglcore.so.535.183.01
I0829 16:57:45.372605 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-cfg.so.535.183.01
I0829 16:57:45.372740 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvidia-allocator.so.535.183.01
I0829 16:57:45.372867 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libnvcuvid.so.535.183.01
I0829 16:57:45.373173 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libcudadebugger.so.535.183.01
I0829 16:57:45.373262 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libcuda.so.535.183.01
I0829 16:57:45.373529 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libGLX_nvidia.so.535.183.01
I0829 16:57:45.373651 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libGLESv2_nvidia.so.535.183.01
I0829 16:57:45.373728 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libGLESv1_CM_nvidia.so.535.183.01
I0829 16:57:45.373830 1151 nvc_info.c:175] selecting /var/lib/nvidia/lib64/libEGL_nvidia.so.535.183.01
W0829 16:57:45.373908 1151 nvc_info.c:401] missing library libnvidia-nscq.so
W0829 16:57:45.373931 1151 nvc_info.c:401] missing library libnvidia-gpucomp.so
W0829 16:57:45.373946 1151 nvc_info.c:401] missing library libnvidia-fatbinaryloader.so
W0829 16:57:45.373963 1151 nvc_info.c:401] missing library libnvidia-compiler.so
W0829 16:57:45.373977 1151 nvc_info.c:401] missing library libnvidia-ifr.so
W0829 16:57:45.373992 1151 nvc_info.c:401] missing library libnvidia-cbl.so
W0829 16:57:45.374003 1151 nvc_info.c:405] missing compat32 library libnvidia-ml.so
W0829 16:57:45.374020 1151 nvc_info.c:405] missing compat32 library libnvidia-cfg.so
W0829 16:57:45.374035 1151 nvc_info.c:405] missing compat32 library libnvidia-nscq.so
W0829 16:57:45.374052 1151 nvc_info.c:405] missing compat32 library libcuda.so
W0829 16:57:45.374067 1151 nvc_info.c:405] missing compat32 library libcudadebugger.so
W0829 16:57:45.374084 1151 nvc_info.c:405] missing compat32 library libnvidia-opencl.so
W0829 16:57:45.374102 1151 nvc_info.c:405] missing compat32 library libnvidia-gpucomp.so
W0829 16:57:45.374122 1151 nvc_info.c:405] missing compat32 library libnvidia-ptxjitcompiler.so
W0829 16:57:45.374142 1151 nvc_info.c:405] missing compat32 library libnvidia-fatbinaryloader.so
W0829 16:57:45.374159 1151 nvc_info.c:405] missing compat32 library libnvidia-allocator.so
W0829 16:57:45.374178 1151 nvc_info.c:405] missing compat32 library libnvidia-compiler.so
W0829 16:57:45.374195 1151 nvc_info.c:405] missing compat32 library libnvidia-pkcs11.so
W0829 16:57:45.374207 1151 nvc_info.c:405] missing compat32 library libnvidia-pkcs11-openssl3.so
W0829 16:57:45.374222 1151 nvc_info.c:405] missing compat32 library libnvidia-nvvm.so
W0829 16:57:45.374233 1151 nvc_info.c:405] missing compat32 library libnvidia-ngx.so
W0829 16:57:45.374250 1151 nvc_info.c:405] missing compat32 library libvdpau_nvidia.so
W0829 16:57:45.374266 1151 nvc_info.c:405] missing compat32 library libnvidia-encode.so
W0829 16:57:45.374281 1151 nvc_info.c:405] missing compat32 library libnvidia-opticalflow.so
W0829 16:57:45.374295 1151 nvc_info.c:405] missing compat32 library libnvcuvid.so
W0829 16:57:45.374309 1151 nvc_info.c:405] missing compat32 library libnvidia-eglcore.so
W0829 16:57:45.374322 1151 nvc_info.c:405] missing compat32 library libnvidia-glcore.so
W0829 16:57:45.374335 1151 nvc_info.c:405] missing compat32 library libnvidia-tls.so
W0829 16:57:45.374344 1151 nvc_info.c:405] missing compat32 library libnvidia-glsi.so
W0829 16:57:45.374353 1151 nvc_info.c:405] missing compat32 library libnvidia-fbc.so
W0829 16:57:45.374379 1151 nvc_info.c:405] missing compat32 library libnvidia-ifr.so
W0829 16:57:45.374394 1151 nvc_info.c:405] missing compat32 library libnvidia-rtcore.so
W0829 16:57:45.374409 1151 nvc_info.c:405] missing compat32 library libnvoptix.so
W0829 16:57:45.374421 1151 nvc_info.c:405] missing compat32 library libGLX_nvidia.so
W0829 16:57:45.374438 1151 nvc_info.c:405] missing compat32 library libEGL_nvidia.so
W0829 16:57:45.374454 1151 nvc_info.c:405] missing compat32 library libGLESv2_nvidia.so
W0829 16:57:45.374478 1151 nvc_info.c:405] missing compat32 library libGLESv1_CM_nvidia.so
W0829 16:57:45.374494 1151 nvc_info.c:405] missing compat32 library libnvidia-glvkspirv.so
W0829 16:57:45.374508 1151 nvc_info.c:405] missing compat32 library libnvidia-cbl.so
I0829 16:57:45.374582 1151 nvc_info.c:301] selecting /var/lib/nvidia/bin/nvidia-smi
I0829 16:57:45.374652 1151 nvc_info.c:301] selecting /var/lib/nvidia/bin/nvidia-debugdump
I0829 16:57:45.374718 1151 nvc_info.c:301] selecting /var/lib/nvidia/bin/nvidia-persistenced
I0829 16:57:45.374834 1151 nvc_info.c:301] selecting /var/lib/nvidia/bin/nvidia-cuda-mps-control
I0829 16:57:45.374896 1151 nvc_info.c:301] selecting /var/lib/nvidia/bin/nvidia-cuda-mps-server
W0829 16:57:45.375410 1151 nvc_info.c:427] missing binary nv-fabricmanager
W0829 16:57:45.375509 1151 nvc_info.c:470] missing firmware path /usr/lib/firmware/nvidia/535.183.01/gsp*.bin
I0829 16:57:45.375569 1151 nvc_info.c:560] listing device /dev/nvidiactl
I0829 16:57:45.375581 1151 nvc_info.c:560] listing device /dev/nvidia-uvm
I0829 16:57:45.375598 1151 nvc_info.c:560] listing device /dev/nvidia-uvm-tools
I0829 16:57:45.375612 1151 nvc_info.c:560] listing device /dev/nvidia-modeset
W0829 16:57:45.375759 1151 nvc_info.c:351] missing ipc path /var/run/nvidia-persistenced/socket
W0829 16:57:45.375927 1151 nvc_info.c:351] missing ipc path /var/run/nvidia-fabricmanager/socket
W0829 16:57:45.375977 1151 nvc_info.c:351] missing ipc path /tmp/nvidia-mps
I0829 16:57:45.375999 1151 nvc_info.c:853] requesting device information with ''
I0829 16:57:45.387354 1151 nvc_info.c:744] listing device /dev/nvidia0 (GPU-46b9201e-aeac-3e52-2772-4ffd67561693 at 00000000:00:04.0)
NVRM version:   535.183.01
CUDA version:   12.2

Device Index:   0
Device Minor:   0
Model:          Tesla T4
Brand:          Nvidia
GPU UUID:       GPU-46b9201e-aeac-3e52-2772-4ffd67561693
Bus Location:   00000000:00:04.0
Architecture:   7.5
I0829 16:57:45.387443 1151 nvc.c:452] shutting down library context
I0829 16:57:45.387530 1154 rpc.c:95] terminating nvcgo rpc service
I0829 16:57:45.388109 1151 rpc.c:135] nvcgo rpc service terminated successfully
I0829 16:57:45.393061 1153 rpc.c:95] terminating driver rpc service
I0829 16:57:45.393287 1151 rpc.c:135] driver rpc service terminated successfully
gfrankliu-t4-ws ➜  ~ 

It seems the nvidia container toolkit doesn't like it when cloud has nvidia driver in /var/lib/nvidia. If I manually volume mount the /var/lib/nvidia from the host into the container, it will work:

gfrankliu-t4-ws ➜  ~ docker run --privileged --rm --gpus all -it -v /var/lib/nvidia:/usr/local/nvidia nvcr.io/nvidia/cuda nvidia-smi 
Thu Aug 29 17:00:18 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
gfrankliu-t4-ws ➜  ~ 

How can I tell nvidia-container-toolkit to automatically mount nvidia-smi into the container? The toolkit seems to only like nvidia-smi in the /usr/bin. eg: if I manually copy binaries from /var/lib/nvidia/bin to /usr/bin, I can then see nvidia-smi mounted into the container:

gfrankliu-t4-ws ➜  ~ sudo cp -a /var/lib/nvidia/bin/* /usr/bin
gfrankliu-t4-ws ➜  ~ docker run --privileged --rm --gpus all -it nvcr.io/nvidia/cuda nvidia-smi
Thu Aug 29 17:03:15 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   37C    P8               9W /  70W |      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
gfrankliu-t4-ws ➜  ~ 
gfrankliu commented 2 weeks ago
# Docker comes with Debian 12
gfrankliu-t4-ws ➜  ~ docker version                
Client:
 Version:           20.10.24+dfsg1
 API version:       1.41
 Go version:        go1.19.8
 Git commit:        297e128
 Built:             Thu May 18 08:38:34 2023
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.24+dfsg1
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.19.8
  Git commit:       5d6db84
  Built:            Thu May 18 08:38:34 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.20~ds1
  GitCommit:        1.6.20~ds1-1+b1
 runc:
  Version:          1.1.5+ds1
  GitCommit:        1.1.5+ds1-1+deb12u1
 docker-init:
  Version:          0.19.0
  GitCommit:        
gfrankliu-t4-ws ➜  ~ nvidia-container-cli --version
cli-version: 1.16.1
lib-version: 1.16.1
build date: 2024-07-23T14:57+00:00
build revision: 4c2494f16573b585788a42e9c7bee76ecd48c73d
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
gfrankliu-t4-ws ➜  ~