NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.52k stars 274 forks source link

Cannot access GPUs from inside a container after rebooting a host machine #32

Open Daig-O opened 2 years ago

Daig-O commented 2 years ago

Hello, I have a GPU-enabled on-prem containerd server with the following tools:

- OS: Red Hat Enterprise Linux release 8.6
- NVIDIA driver: 515.65.01
- nvidia-container-toolkit: 1.10.0-1.x86_64
- containerd: v1.6.2

After setting up the cluster, I successfully executed the nvidia-smi command from a sample container using ctr:

# ctr run -t --rm --gpus 0 nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 sample nvidia-smi
Mon Sep  5 05:21:44 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
...

However, after rebooting the cluster, I can't (but containerd recognizes GPU #0 as the run command doesn't fail):

# ctr run -t --rm --gpus 0 nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 sample nvidia-smi
No devices were found

Then if I execute nvidia-smi on the host machine, I get to execute nvidia-smi from the container again for some reason:

# nvidia-smi
...
# ctr run -t --rm --gpus 0 nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 sample nvidia-smi
Mon Sep  5 05:26:30 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+

After rebooting the host, the container can't access the GPU until I execute nvidia-smi on the host.

How do I make the GPU available immediately from the container after a reboot? Am I doing something wrong with the settings?

elezar commented 2 years ago

@Daig-O is your system set up to create the NVIDIA device nodes (/dev/nvidia*) on startup? Note that unless configured explicitly to not do so, running nvidia-smi will create the required device nodes which could be why this is not working.

The nvidia-container-cli which is invoked from the nvidia-container-runtime-hook is supposed to load the kernel modules and ensure that the device nodes are created, but there may be something that is preventing this from happening.

Could you confirm that the device nodes don't exist after rebooting? Then, enable debug logging for the nvidia-container-cli by commenting out the #debug = line in the file /etc/nvidia-container-runtime.toml and include the contents of the file /var/log/nvidia-container-toolkit.log for an unsuccessful attempt.

Daig-O commented 2 years ago

@elezar I can see some device nodes under /dev (even when nvidia-smi fails from inside a container).

I uncommented two lines in the file /etc/nvidia-container-runtime/config.toml and rebooted server, then I saw the following logs:

-- WARNING, the following logs are for debugging purposes only --

I0915 04:17:37.124058 17003 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I0915 04:17:37.124102 17003 nvc.c:350] using root /
I0915 04:17:37.124106 17003 nvc.c:351] using ldcache /etc/ld.so.cache
I0915 04:17:37.124109 17003 nvc.c:352] using unprivileged user 65534:65534
I0915 04:17:37.124122 17003 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0915 04:17:37.124269 17003 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W0915 04:17:37.124843 17003 nvc.c:258] failed to detect NVIDIA devices
I0915 04:17:37.124999 17019 nvc.c:278] loading kernel module nvidia
I0915 04:17:37.125147 17019 nvc.c:282] running mknod for /dev/nvidiactl
I0915 04:17:37.125199 17019 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0915 04:17:37.129383 17019 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0915 04:17:37.129565 17019 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0915 04:17:37.130848 17019 nvc.c:296] loading kernel module nvidia_uvm
I0915 04:17:37.164064 17019 nvc.c:300] running mknod for /dev/nvidia-uvm
I0915 04:17:37.164174 17019 nvc.c:305] loading kernel module nvidia_modeset
I0915 04:17:37.164233 17019 nvc.c:309] running mknod for /dev/nvidia-modeset
I0915 04:17:37.164568 17096 rpc.c:71] starting driver rpc service
I0915 04:17:37.174887 17100 rpc.c:71] starting nvcgo rpc service
I0915 04:17:37.176991 17003 nvc_container.c:240] configuring container with 'utility supervised'
I0915 04:17:37.178217 17003 nvc_container.c:262] setting pid to 16971
I0915 04:17:37.178230 17003 nvc_container.c:263] setting rootfs to /run/containerd/io.containerd.runtime.v2.task/k8s.io/e195c985cba3594ed3939bb303f1eb5612e1557d17ccb4252ba5be397232cb7c/rootfs
I0915 04:17:37.178239 17003 nvc_container.c:264] setting owner to 0:0
I0915 04:17:37.178248 17003 nvc_container.c:265] setting bins directory to /usr/bin
I0915 04:17:37.178251 17003 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu
I0915 04:17:37.178256 17003 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu
I0915 04:17:37.178259 17003 nvc_container.c:268] setting cudart directory to /usr/local/cuda
I0915 04:17:37.178266 17003 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig (host relative)
I0915 04:17:37.178271 17003 nvc_container.c:270] setting mount namespace to /proc/16971/ns/mnt
I0915 04:17:37.178274 17003 nvc_container.c:272] detected cgroupv1
I0915 04:17:37.178281 17003 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/devices/system.slice/containerd.service/kubepods-besteffort-podf65510fa_2571_4adc_9856_071d1f405e6c.slice:cri-containerd:e195c985cba3594ed3939bb303f1eb5612e1557d17ccb4252ba5be397232cb7c
I0915 04:17:37.178289 17003 nvc_info.c:766] requesting driver information with ''
I0915 04:17:37.186148 17003 nvc_info.c:173] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.515.65.01
I0915 04:17:37.187800 17003 nvc_info.c:173] selecting /usr/lib64/libnvoptix.so.515.65.01
I0915 04:17:37.188023 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-tls.so.515.65.01
I0915 04:17:37.189143 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-rtcore.so.515.65.01
I0915 04:17:37.190180 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.515.65.01
I0915 04:17:37.190388 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-opticalflow.so.515.65.01
I0915 04:17:37.191414 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-opencl.so.515.65.01
I0915 04:17:37.192489 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-ngx.so.515.65.01
I0915 04:17:37.192525 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-ml.so.515.65.01
I0915 04:17:37.193555 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-glvkspirv.so.515.65.01
I0915 04:17:37.193996 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-glsi.so.515.65.01
I0915 04:17:37.195026 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-glcore.so.515.65.01
I0915 04:17:37.195265 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-fbc.so.515.65.01
I0915 04:17:37.195485 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-encode.so.515.65.01
I0915 04:17:37.196497 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-eglcore.so.515.65.01
I0915 04:17:37.197507 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-compiler.so.515.65.01
I0915 04:17:37.197755 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-cfg.so.515.65.01
I0915 04:17:37.198050 17003 nvc_info.c:173] selecting /usr/lib64/libnvidia-allocator.so.515.65.01
I0915 04:17:37.199070 17003 nvc_info.c:173] selecting /usr/lib64/libnvcuvid.so.515.65.01
I0915 04:17:37.199172 17003 nvc_info.c:173] selecting /usr/lib64/libcuda.so.515.65.01
I0915 04:17:37.199824 17003 nvc_info.c:173] selecting /usr/lib64/libGLX_nvidia.so.515.65.01
I0915 04:17:37.200064 17003 nvc_info.c:173] selecting /usr/lib64/libGLESv2_nvidia.so.515.65.01
I0915 04:17:37.200254 17003 nvc_info.c:173] selecting /usr/lib64/libGLESv1_CM_nvidia.so.515.65.01
I0915 04:17:37.200913 17003 nvc_info.c:173] selecting /usr/lib64/libEGL_nvidia.so.515.65.01
W0915 04:17:37.200934 17003 nvc_info.c:399] missing library libnvidia-nscq.so
W0915 04:17:37.200940 17003 nvc_info.c:399] missing library libcudadebugger.so
W0915 04:17:37.200945 17003 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W0915 04:17:37.200949 17003 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0915 04:17:37.200951 17003 nvc_info.c:399] missing library libnvidia-ifr.so
W0915 04:17:37.200955 17003 nvc_info.c:399] missing library libnvidia-cbl.so
W0915 04:17:37.200960 17003 nvc_info.c:403] missing compat32 library libnvidia-ml.so
W0915 04:17:37.200963 17003 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W0915 04:17:37.200966 17003 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0915 04:17:37.200969 17003 nvc_info.c:403] missing compat32 library libcuda.so
W0915 04:17:37.200972 17003 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0915 04:17:37.200977 17003 nvc_info.c:403] missing compat32 library libnvidia-opencl.so
W0915 04:17:37.200980 17003 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so
W0915 04:17:37.200983 17003 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W0915 04:17:37.200986 17003 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0915 04:17:37.200989 17003 nvc_info.c:403] missing compat32 library libnvidia-compiler.so
W0915 04:17:37.200992 17003 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0915 04:17:37.200997 17003 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W0915 04:17:37.201001 17003 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W0915 04:17:37.201004 17003 nvc_info.c:403] missing compat32 library libnvidia-encode.so
W0915 04:17:37.201007 17003 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so
W0915 04:17:37.201010 17003 nvc_info.c:403] missing compat32 library libnvcuvid.so
W0915 04:17:37.201015 17003 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so
W0915 04:17:37.201022 17003 nvc_info.c:403] missing compat32 library libnvidia-glcore.so
W0915 04:17:37.201025 17003 nvc_info.c:403] missing compat32 library libnvidia-tls.so
W0915 04:17:37.201029 17003 nvc_info.c:403] missing compat32 library libnvidia-glsi.so
W0915 04:17:37.201032 17003 nvc_info.c:403] missing compat32 library libnvidia-fbc.so
W0915 04:17:37.201039 17003 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W0915 04:17:37.201042 17003 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W0915 04:17:37.201052 17003 nvc_info.c:403] missing compat32 library libnvoptix.so
W0915 04:17:37.201055 17003 nvc_info.c:403] missing compat32 library libGLX_nvidia.so
W0915 04:17:37.201059 17003 nvc_info.c:403] missing compat32 library libEGL_nvidia.so
W0915 04:17:37.201063 17003 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so
W0915 04:17:37.201067 17003 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so
W0915 04:17:37.201071 17003 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so
W0915 04:17:37.201080 17003 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I0915 04:17:37.201431 17003 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I0915 04:17:37.201468 17003 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I0915 04:17:37.201485 17003 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I0915 04:17:37.201510 17003 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I0915 04:17:37.201527 17003 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W0915 04:17:37.201555 17003 nvc_info.c:425] missing binary nv-fabricmanager
I0915 04:17:37.214300 17003 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/515.65.01/gsp.bin
I0915 04:17:37.214344 17003 nvc_info.c:529] listing device /dev/nvidiactl
I0915 04:17:37.214348 17003 nvc_info.c:529] listing device /dev/nvidia-uvm
I0915 04:17:37.214351 17003 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0915 04:17:37.214354 17003 nvc_info.c:529] listing device /dev/nvidia-modeset
W0915 04:17:37.214372 17003 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket
W0915 04:17:37.214386 17003 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0915 04:17:37.214405 17003 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0915 04:17:37.214427 17003 nvc_info.c:822] requesting device information with ''
I0915 04:17:37.214546 17003 nvc_mount.c:366] mounting tmpfs at /run/containerd/io.containerd.runtime.v2.task/k8s.io/e195c985cba3594ed3939bb303f1eb5612e1557d17ccb4252ba5be397232cb7c/rootfs/proc/driver/nvidia
I0915 04:17:37.246699 17003 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /run/containerd/io.containerd.runtime.v2.task/k8s.io/e195c985cba3594ed3939bb303f1eb5612e1557d17ccb4252ba5be397232cb7c/rootfs/usr/bin/nvidia-smi
I0915 04:17:37.246778 17003 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /run/containerd/io.containerd.runtime.v2.task/k8s.io/e195c985cba3594ed3939bb303f1eb5612e1557d17ccb4252ba5be397232cb7c/rootfs/usr/bin/nvidia-debugdump
I0915 04:17:37.246822 17003 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /run/containerd/io.containerd.runtime.v2.task/k8s.io/e195c985cba3594ed3939bb303f1eb5612e1557d17ccb4252ba5be397232cb7c/rootfs/usr/bin/nvidia-persistenced
I0915 04:17:37.257406 17003 nvc_mount.c:134] mounting /usr/lib64/libnvidia-ml.so.515.65.01 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/e195c985cba3594ed3939bb303f1eb5612e1557d17ccb4252ba5be397232cb7c/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.65.01
I0915 04:17:37.257508 17003 nvc_mount.c:134] mounting /usr/lib64/libnvidia-cfg.so.515.65.01 at /run/containerd/io.containerd.runtime.v2.task/k8s.io/e195c985cba3594ed3939bb303f1eb5612e1557d17ccb4252ba5be397232cb7c/rootfs/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.65.01
I0915 04:17:37.257729 17003 nvc_mount.c:85] mounting /usr/lib/firmware/nvidia/515.65.01/gsp.bin at /run/containerd/io.containerd.runtime.v2.task/k8s.io/e195c985cba3594ed3939bb303f1eb5612e1557d17ccb4252ba5be397232cb7c/rootfs/lib/firmware/nvidia/515.65.01/gsp.bin with flags 0x7
I0915 04:17:37.257776 17003 nvc_mount.c:230] mounting /dev/nvidiactl at /run/containerd/io.containerd.runtime.v2.task/k8s.io/e195c985cba3594ed3939bb303f1eb5612e1557d17ccb4252ba5be397232cb7c/rootfs/dev/nvidiactl
I0915 04:17:37.258024 17003 nvc_ldcache.c:372] executing /sbin/ldconfig from host at /run/containerd/io.containerd.runtime.v2.task/k8s.io/e195c985cba3594ed3939bb303f1eb5612e1557d17ccb4252ba5be397232cb7c/rootfs
I0915 04:17:37.379468 17003 nvc.c:434] shutting down library context
I0915 04:17:37.379547 17100 rpc.c:95] terminating nvcgo rpc service
I0915 04:17:37.380115 17003 rpc.c:135] nvcgo rpc service terminated successfully
I0915 04:17:37.381320 17096 rpc.c:95] terminating driver rpc service
I0915 04:17:37.381565 17003 rpc.c:135] driver rpc service terminated successfully

I did nvidia-smi from a container and a host, but there were no log outputs for that.

Daig-O commented 2 years ago

This might be hardware-specific. I have tried the same thing on another server with the same versions of the OS, Kubernetes, NVIDIA driver, container runtime, etc... and everything worked well. The only difference is the built-in graphics, this issue only occurs on a server without built-in graphics (needs an NVIDIA graphic card to work).

Is this helpful to find other causes?

elezar commented 2 years ago

What does nvidia-smi -L show on the host? You mention that the server has built in graphics. It could be that this is the device that is being selected due to the --gpus 0 flag being passed to ctr. Could you change the index to 1 and see if this addresses the behaviour?

Daig-O commented 2 years ago

The machine in question only has a single NVIDIA GPU, there are no built-in graphics:

# nvidia-smi -L
GPU 0: NVIDIA T1000 8GB (UUID: xxxxxx.....)