"Failed to initialize NVML: Unknown Error" after random amount of time

iFede94 commented 2 years ago

1. Issue or feature description

After a random amount of time (it could be hours or days) the GPUs become unavailable inside all the running containers and nvidia-smi returns "Failed to initialize NVML: Unknown Error". A restart of all the containers fixes the issue and the GPUs return available. Outside the containers the GPUs are still working correctly. I tried searching in the open/closed issues but I could not find any solution.

2. Steps to reproduce the issue

All the containers are run with docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash

3. Information to attach

[X] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
```
-- WARNING, the following logs are for debugging purposes only --
```

I0831 10:36:45.129762 2174149 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae) I0831 10:36:45.129878 2174149 nvc.c:350] using root / I0831 10:36:45.129892 2174149 nvc.c:351] using ldcache /etc/ld.so.cache I0831 10:36:45.129906 2174149 nvc.c:352] using unprivileged user 1000:1000 I0831 10:36:45.129960 2174149 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0831 10:36:45.130411 2174149 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment W0831 10:36:45.132458 2174150 nvc.c:273] failed to set inheritable capabilities W0831 10:36:45.132555 2174150 nvc.c:274] skipping kernel modules load due to failure I0831 10:36:45.133242 2174151 rpc.c:71] starting driver rpc service I0831 10:36:45.141625 2174152 rpc.c:71] starting nvcgo rpc service I0831 10:36:45.144941 2174149 nvc_info.c:766] requesting driver information with '' I0831 10:36:45.146226 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.515.48.07 I0831 10:36:45.146379 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.48.07 I0831 10:36:45.146563 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.48.07 I0831 10:36:45.146792 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07 I0831 10:36:45.146986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.515.48.07 I0831 10:36:45.147178 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.48.07 I0831 10:36:45.147375 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.48.07 I0831 10:36:45.147400 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.48.07 I0831 10:36:45.147598 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.48.07 I0831 10:36:45.147777 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.48.07 I0831 10:36:45.147986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.48.07 I0831 10:36:45.148258 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.515.48.07 I0831 10:36:45.148506 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.515.48.07 I0831 10:36:45.148699 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.48.07 I0831 10:36:45.148915 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.48.07 I0831 10:36:45.148942 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.48.07 I0831 10:36:45.149219 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.48.07 I0831 10:36:45.149467 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.515.48.07 I0831 10:36:45.149591 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.515.48.07 I0831 10:36:45.149814 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.515.48.07 I0831 10:36:45.149996 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.515.48.07 I0831 10:36:45.150224 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07 I0831 10:36:45.150437 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.48.07 I0831 10:36:45.150772 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.515.48.07 I0831 10:36:45.150978 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07 I0831 10:36:45.151147 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.515.48.07 I0831 10:36:45.151335 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.515.48.07 I0831 10:36:45.151592 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.515.48.07 I0831 10:36:45.151786 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.515.48.07 I0831 10:36:45.151970 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.515.48.07 I0831 10:36:45.152225 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.515.48.07 I0831 10:36:45.152480 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.515.48.07 I0831 10:36:45.152791 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.515.48.07 I0831 10:36:45.152999 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.515.48.07 I0831 10:36:45.153254 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.515.48.07 I0831 10:36:45.153580 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.515.48.07 I0831 10:36:45.153853 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.515.48.07 I0831 10:36:45.154063 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.515.48.07 I0831 10:36:45.154259 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.515.48.07 I0831 10:36:45.154473 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07 I0831 10:36:45.154696 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.515.48.07 W0831 10:36:45.154723 2174149 nvc_info.c:399] missing library libnvidia-nscq.so W0831 10:36:45.154726 2174149 nvc_info.c:399] missing library libcudadebugger.so W0831 10:36:45.154729 2174149 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so W0831 10:36:45.154731 2174149 nvc_info.c:399] missing library libnvidia-pkcs11.so W0831 10:36:45.154733 2174149 nvc_info.c:399] missing library libvdpau_nvidia.so W0831 10:36:45.154735 2174149 nvc_info.c:399] missing library libnvidia-ifr.so W0831 10:36:45.154737 2174149 nvc_info.c:399] missing library libnvidia-cbl.so W0831 10:36:45.154739 2174149 nvc_info.c:403] missing compat32 library libnvidia-cfg.so W0831 10:36:45.154741 2174149 nvc_info.c:403] missing compat32 library libnvidia-nscq.so W0831 10:36:45.154743 2174149 nvc_info.c:403] missing compat32 library libcudadebugger.so W0831 10:36:45.154746 2174149 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so W0831 10:36:45.154748 2174149 nvc_info.c:403] missing compat32 library libnvidia-allocator.so W0831 10:36:45.154750 2174149 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so W0831 10:36:45.154752 2174149 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W0831 10:36:45.154754 2174149 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so W0831 10:36:45.154756 2174149 nvc_info.c:403] missing compat32 library libnvidia-ifr.so W0831 10:36:45.154758 2174149 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W0831 10:36:45.154760 2174149 nvc_info.c:403] missing compat32 library libnvoptix.so W0831 10:36:45.154762 2174149 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I0831 10:36:45.154919 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-smi I0831 10:36:45.154945 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I0831 10:36:45.154954 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I0831 10:36:45.154970 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control I0831 10:36:45.154980 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server W0831 10:36:45.155027 2174149 nvc_info.c:425] missing binary nv-fabricmanager I0831 10:36:45.155044 2174149 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/515.48.07/gsp.bin I0831 10:36:45.155058 2174149 nvc_info.c:529] listing device /dev/nvidiactl I0831 10:36:45.155061 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm I0831 10:36:45.155063 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm-tools I0831 10:36:45.155065 2174149 nvc_info.c:529] listing device /dev/nvidia-modeset I0831 10:36:45.155080 2174149 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket W0831 10:36:45.155092 2174149 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket W0831 10:36:45.155100 2174149 nvc_info.c:349] missing ipc path /tmp/nvidia-mps I0831 10:36:45.155102 2174149 nvc_info.c:822] requesting device information with '' I0831 10:36:45.161039 2174149 nvc_info.c:713] listing device /dev/nvidia0 (GPU-13fd0930-06c3-5975-8720-72c72ee7a823 at 00000000:01:00.0) I0831 10:36:45.166471 2174149 nvc_info.c:713] listing device /dev/nvidia1 (GPU-a76d37d7-5ed0-58d9-6087-b18fee984570 at 00000000:02:00.0) NVRM version: 515.48.07 CUDA version: 11.7

Device Index: 0 Device Minor: 0 Model: NVIDIA GeForce RTX 2080 Ti Brand: GeForce GPU UUID: GPU-13fd0930-06c3-5975-8720-72c72ee7a823 Bus Location: 00000000:01:00.0 Architecture: 7.5

Device Index: 1 Device Minor: 1 Model: NVIDIA GeForce RTX 2080 Ti Brand: GeForce GPU UUID: GPU-a76d37d7-5ed0-58d9-6087-b18fee984570 Bus Location: 00000000:02:00.0 Architecture: 7.5 I0831 10:36:45.166493 2174149 nvc.c:434] shutting down library context I0831 10:36:45.166540 2174152 rpc.c:95] terminating nvcgo rpc service I0831 10:36:45.166751 2174149 rpc.c:135] nvcgo rpc service terminated successfully I0831 10:36:45.167790 2174151 rpc.c:95] terminating driver rpc service I0831 10:36:45.167907 2174149 rpc.c:135] driver rpc service terminated successfully


 - [X] Kernel version from `uname -a`

Linux wds-co-ml 5.15.0-43-generic NVIDIA/nvidia-docker#46-Ubuntu SMP Tue Jul 12 10:30:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux


 - [X] Driver information from `nvidia-smi -a`

==============NVSMI LOG==============

Timestamp : Wed Aug 31 12:42:55 2022 Driver Version : 515.48.07 CUDA Version : 11.7

Attached GPUs : 2 GPU 00000000:01:00.0 Product Name : NVIDIA GeForce RTX 2080 Ti Product Brand : GeForce Product Architecture : Turing Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-13fd0930-06c3-5975-8720-72c72ee7a823 Minor Number : 0 VBIOS Version : 90.02.0B.00.C7 MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Module ID : 0 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1E0710DE Bus Id : 00000000:01:00.0 Sub System Id : 0x150319DA GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 0 % Performance State : P8 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11264 MiB Reserved : 244 MiB Used : 1 MiB Free : 11018 MiB BAR1 Memory Usage Total : 256 MiB Used : 3 MiB Free : 253 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 30 C GPU Shutdown Temp : 94 C GPU Slowdown Temp : 91 C GPU Max Operating Temp : 89 C GPU Target Temperature : 84 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 20.87 W Power Limit : 260.00 W Default Power Limit : 260.00 W Enforced Power Limit : 260.00 W Min Power Limit : 100.00 W Max Power Limit : 300.00 W Clocks Graphics : 300 MHz SM : 300 MHz Memory : 405 MHz Video : 540 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2160 MHz SM : 2160 MHz Memory : 7000 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes : None

GPU 00000000:02:00.0 Product Name : NVIDIA GeForce RTX 2080 Ti Product Brand : GeForce Product Architecture : Turing Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-a76d37d7-5ed0-58d9-6087-b18fee984570 Minor Number : 1 VBIOS Version : 90.02.17.00.58 MultiGPU Board : No Board ID : 0x200 GPU Part Number : N/A Module ID : 0 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x02 Device : 0x00 Domain : 0x0000 Device Id : 0x1E0710DE Bus Id : 00000000:02:00.0 Sub System Id : 0x150319DA GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 35 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11264 MiB Reserved : 244 MiB Used : 1 MiB Free : 11018 MiB BAR1 Memory Usage Total : 256 MiB Used : 27 MiB Free : 229 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 28 C GPU Shutdown Temp : 94 C GPU Slowdown Temp : 91 C GPU Max Operating Temp : 89 C GPU Target Temperature : 84 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 6.66 W Power Limit : 260.00 W Default Power Limit : 260.00 W Enforced Power Limit : 260.00 W Min Power Limit : 100.00 W Max Power Limit : 300.00 W Clocks Graphics : 300 MHz SM : 300 MHz Memory : 405 MHz Video : 540 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2160 MHz SM : 2160 MHz Memory : 7000 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes : None


 - [X] Docker version from `docker version`

Client: Docker Engine - Community Version: 20.10.17 API version: 1.41 Go version: go1.17.11 Git commit: 100c701 Built: Mon Jun 6 23:02:46 2022 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.17 API version: 1.41 (minimum version 1.12) Go version: go1.17.11 Git commit: a89b842 Built: Mon Jun 6 23:00:51 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.6 GitCommit: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1 runc: Version: 1.1.2 GitCommit: v1.1.2-0-ga916309 docker-init: Version: 0.19.0 GitCommit: de40ad0


 - [X] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

ii libnvidia-cfg1-515:amd64 ii libnvidia-common-515 ii libnvidia-compute-515:amd64 ii libnvidia-compute-515:i386 ii libnvidia-container-tools ii libnvidia-container1:amd64 ii libnvidia-decode-515:amd64 ii libnvidia-decode-515:i386 ii libnvidia-egl-wayland1:amd64 ii libnvidia-encode-515:amd64 ii libnvidia-encode-515:i386 ii libnvidia-extra-515:amd64 ii libnvidia-fbc1-515:amd64 ii libnvidia-fbc1-515:i386 ii libnvidia-gl-515:amd64 ii libnvidia-gl-515:i386 ii linux-modules-nvidia-515- ii linux-modules-nvidia-515- ii linux-objects-nvidia-515- ii linux-signatures-nvidia-5 ii nvidia-compute-utils-515 ii nvidia-container-toolkit ii nvidia-docker2 ii nvidia-driver-515 ii nvidia-kernel-common-515 ii nvidia-kernel-source-515 ii nvidia-prime ii nvidia-settings ii nvidia-utils-515 ii xserver-xorg-video-nvidia-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA binary OpenGL/GLX configuration library 515.48.07-0ubuntu0.22.04.2 all Shared files used by the NVIDIA libraries 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA libcompute package 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA libcompute package 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools) 1.10.0-1 amd64 NVIDIA container runtime library 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA Video Decoding runtime libraries 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA Video Decoding runtime libraries 1:1.1.9-1.1 amd64 Wayland EGL External Platform library -- shared library 515.48.07-0ubuntu0.22.04.2 amd64 NVENC Video Encoding runtime library 515.48.07-0ubuntu0.22.04.2 i386 NVENC Video Encoding runtime library 515.48.07-0ubuntu0.22.04.2 amd64 Extra libraries for the NVIDIA driver 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD 5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel nvidia modules for version 5.15.0-43 generic-hwe-22.04 5.15.0-43.46 amd64 Extra drivers for nvidia-515 for the generic-hwe-22.04 flavour 5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel nvidia modules for version 5.15.0-43 (objects) .15.0-43-generic 5.15.0-43.46 amd64 Linux kernel signatures for nvidia modules for version 5.15.0-43-generic 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA compute utilities 1.10.0-1 amd64 NVIDIA container runtime hook 2.11.0-1 all nvidia-docker CLI wrapper 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA driver metapackage 515.48.07-0ubuntu0.22.04.2 amd64 Shared files used with the kernel module 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA kernel source package 0.8.17.1 all Tools to enable NVIDIA's Prime 510.47.03-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA driver support binaries 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA binary Xorg driver


 - [X] NVIDIA container library version from `nvidia-container-cli -V`

cli-version: 1.10.0 lib-version: 1.10.0 build date: 2022-06-13T10:39+00:00 build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

 - [X] Docker command, image and tag used
```bash
docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash

elezar commented 1 year ago

@gaopeiliang as per https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1420855027, when using systemd cgroup management (and newer systemd versions) it is required to pass the device nodes when launching a container. This is a separate issue from the runc bug that was fixed or for which the /dev/char symlinks were a workaround.

arnaldo2792 commented 1 year ago

Hey @klueska, is pass-device-specs still required even after using the udev rule with nvidia-ctx? Or can I just use nvidia-ctx without setting pass-device-specs in the k8s device plugin?

klueska commented 1 year ago

Yes. It is still needed. The fix ensures that device access is not lost even when you use pass-device-specs.

pomodorox commented 1 year ago

With systemd cgroup management you must always pass the nvidia device nodes on the docker command line (which you are not doing).

Meaning you would need to run:
docker run -d \
  --restart unless-stopped \
  --name nvidia-smi-rest \
  --gpus 'all,"capabilities=utility"' \
  --device /dev/nvidiactl \
  --device /dev/nvidia0 \
  ...
  --cpus 1 \
  --memory 1g \
  --memory-swap 1.5g \
  mbentley/nvidia-smi-rest
This is due to the way GPU injection currently happens from within a runc hook when the --gpus flag is used. The hook manually sets up the cgroups for the NVIDIA devices behind the back of docker/containerd/runc -- so when a systemd daemone-reload happens the cgroup access for these devices gets undone (because these runtimes had no way of telling systemd that these devices had been injected by the hook and the reload triggers it to reevaluate all cgroup rules).

This issue only started to be noticed by most people recently because the latest release of docker flipped to using systemd cgroup management by default (as opposed to cgroupfs).

The good news is, once CDI support is added to docker, this won't be necessary anymore. docker/cli#3864

Hi @klueska @elezar , what's the suggested equivalent of the docker --devices flags for Kubernetes GPU pods using containerd?

I added pass-device-specs and created the symlinks but it didn't work for me. I am not sure how we can pass the --devices in a Pod spec. So does it mean this is an acknowleged issue for Kubernetes GPU workloads using systemd cgroup?

Update: tried runc 1.1.7 with systemd 245, but it didn't solve the issue.

gaopeiliang commented 1 year ago

@gaopeiliang as per #1671 (comment), when using systemd cgroup management (and newer systemd versions) it is required to pass the device nodes when launching a container. This is a separate issue from the runc bug that was fixed or for which the /dev/char symlinks were a workaround.

en ... we k8s cluster use old gpu-device-plugin with not support pass-device-specs ; so we should test it;

another questions, what's the mean no-cgroups = bool options in config file /etc/nvidia-container-runtime/config.toml ? any spec or link about it ?

we can use pass-device-specs + no-cgroups = true + systemd to avoid device manager problem ? @klueska @elezar

elezar commented 1 year ago

en ... we k8s cluster use old gpu-device-plugin with not support pass-device-specs ; so we should test it;

Which version are you using?

The no-cgroups option is used to control whether the NVIDIA Container Library should update the cgroups for a container to allow access to a device. For the rootless case, where a user does not have permissions to manage cgroups, this must be disabled. I don't have enough experience to know whether your proposed combination would work as expected.

elezar commented 1 year ago

@didovesei with regards to:

I added pass-device-specs and created the symlinks but it didn't work for me. I am not sure how we can pass the --devices in a Pod spec. So does it mean this is an acknowleged issue for Kubernetes GPU workloads using systemd cgroup?

How did you add the pass-device-specs option? This is an option typically set as an environment variable for the GPU device plugin. Which version of the plugin are you using?

pomodorox commented 1 year ago

@didovesei with regards to:

I added pass-device-specs and created the symlinks but it didn't work for me. I am not sure how we can pass the --devices in a Pod spec. So does it mean this is an acknowleged issue for Kubernetes GPU workloads using systemd cgroup?

How did you add the pass-device-specs option? This is an option typically set as an environment variable for the GPU device plugin. Which version of the plugin are you using?

Hi @elezar , I was using device plugin v0.10.0 + containerd 1.6.0 + systemd 245 + runc 1.1.7. I passed pass-device-specs in the device plugin args.

  containers:
  - args:
    - --fail-on-init-error=false
    - --mig-strategy=mixed
    - --pass-device-specs=true

I think the flag was taking effect (although not working), since now when I run nvidia-smi in the GPU Pod after a daemon-reload, it shows the below message instead of the NVML error.

root@gpu:/# nvidia-smi
No devices were found

I might be a bit unclear in my last comment but I guess my real point is that in @klueska 's comment, it was mentioned that

Note: this does not address the issue where you still need to explicitly pass the device nodes for /dev/nvidia0, /dev/nvidia1, /dev/nvidiactl on the command line (that won’t be fixed until CDI support is added to docker).

This fixes the issue where — even if you do explicitly pass the device nodes — you STILL lose access to the GPUs on a systemctl daemon reload.

AFAIU, however, in K8s context, the devices should be passed into the Pod through device plugin. So we shouldn't be expecting the user to explictly pass the /dev into the Pod. Besides, I am not sure if there is an equivalent of the docker --devices flags in a K8s Pod spec. So I was wondering given all the above points, does it mean that this is an acknowleged limitation with Nvidia K8s solution for a certain combination of configurations (like containerd+systemd+cgroup v1)?

elezar commented 1 year ago

@didovesei was the plugin running as a privileged container? This is required to pass the device nodes.

pomodorox commented 1 year ago

@didovesei was the plugin running as a privileged container? This is required to pass the device nodes.

@elezar It's not in privileged mode. I have been using a config similar to this one for the DP.

Is privileged mode a requirement specific to this issue, or Nvidia suggests using it for the DP in general?

elezar commented 1 year ago

See https://github.com/NVIDIA/k8s-device-plugin#setting-other-helm-chart-values (which needs an update for a disscussion on the options and setting up privileged). Privileged mode is required when passing the device specs so that the device plugin can see all the required device nodes. Otherwise it would not have the required accesss (even though this is also provided by the nvidia container toolkit).

pomodorox commented 1 year ago

See https://github.com/NVIDIA/k8s-device-plugin#setting-other-helm-chart-values (which needs an update for a disscussion on the options and setting up privileged). Privileged mode is required when passing the device specs so that the device plugin can see all the required device nodes. Otherwise it would not have the required accesss (even though this is also provided by the nvidia container toolkit).

Using privileged mode for DP didn't work.. But using privileged mode for user workload Pod did work. Also, it seems that as long as the user workload Pod is privileged, there aren't any problems -- DP doesn't need to be privileged, no symlinks for the char devices need to be created.

klueska commented 1 year ago

That is true, but most users don't want to run their user pods as privileged (and they shouldn't have to if everything else is set up properly).

gaopeiliang commented 1 year ago

en ... we k8s cluster use old gpu-device-plugin with not support pass-device-specs ; so we should test it;

Which version are you using?

The no-cgroups option is used to control whether the NVIDIA Container Library should update the cgroups for a container to allow access to a device. For the rootless case, where a user does not have permissions to manage cgroups, this must be disabled. I don't have enough experience to know whether your proposed combination would work as expected.

device-plugin version 1.0.0-beta

runc will also write cgroup fs if has device list ; so pass-device + no-cgroup=true can always set sucess I tested ....

breakingflower commented 1 year ago

Some relevant comments & solutions from @cdesiniotis at nvidia on the matter: https://github.com/NVIDIA/nvidia-docker/issues/1730

punkerpunker commented 1 year ago

Some relevant comments & solutions from @cdesiniotis at nvidia on the matter: NVIDIA/nvidia-docker#1730

Thanks @breakingflower, that's very useful.

FYI: From the Notice:

Deploying GPU Operator 22.9.2 will automatically fix the issue on all K8s nodes of the cluster (the fix is integrated inside the validator pod which will run when a new node is deployed or at every reboot of the node).

~~Does sound very promising but unfortunately doesn't solve the issue.~~

I can confirm that using the new version of GPU Operator resolves the issue when CDI is enabled in gpu-operator config:

  cdi:
    enabled: true
    default: true

However, I am facing the issue where nvidia-container-toolkit-daemonset couldn't start properly after the reboot of the machine:

  Warning  Failed          4m34s (x4 over 6m10s)  kubelet          Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: could not apply required modification to OCI specification: error modifying OCI spec: failed to inject CDI devices: unresolvable CDI devices management.nvidia.com/gpu=all: unknown

zhanwenchen commented 1 year ago

Any update on this?

klueska commented 1 year ago

Please see this notice from February: https://github.com/NVIDIA/nvidia-docker/issues/1730

leemingeer commented 1 year ago

@klueska

Please see this notice from February: NVIDIA/nvidia-docker#1730 i have seen it in detail, could explain how to get the correct {{NVIDIA_DRIVER_ROOT}} in cases where the driver container is also in use.
i am not clear, the default value in nvidia-ctk is /

pcanas commented 1 year ago

Is there any timeline for a solution besides the workarounds exposed in NVIDIA/nvidia-docker#1730 ?

rogelioamancisidor commented 1 year ago

I tried the suggested approach in #6380, but it didn't solve the problem. It is quite frustrating as I cannot rely on AKS at the moment. I hope this issue is solved soon.

klueska commented 1 year ago

@rogelioamancisidor we've heard that AKS ships with a really old version of the k8s-device-plugin (from 2019!) which doesn't support the PASS_DEVICE_SPECS flag. You will need to update the plugin to a newer one and pass this flag for things to work on AKS.

rogelioamancisidor commented 1 year ago

@klueska Here is the plugin that I got suggested in the other discussion plugin. Do you have a link for a newer k8s-device-plugin? I'll really appreciate it as I have tried different things without any luck.

elezar commented 1 year ago

@klueska Here is the plugin that I got suggested in the other discussion plugin and I just noticed, as you mentioned, that the plugin dates 2019. Do you have a link for a newer k8s-device-plugin? I'll really appreciate it as I have tried different things without any luck.

The plugin is available here: https://github.com/NVIDIA/k8s-device-plugin the README should cover a variety of deployment options, where helm is recommended.

The latest version of the plugin is v0.14.1.

rogelioamancisidor commented 1 year ago

I deployed a DaemonSet for the NVIDIA device plugin using the yaml manifest in the link that I posted. The manifest in the link includes this line - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1. isnt that manifest deploying the latest version then? PASS_DEVICE_SPECS is also set to true as suggested by AKS.

homjay commented 1 year ago

here is the official soluton

https://github.com/NVIDIA/nvidia-docker/issues/1730#issue-1573551271

modify /etc/docker/docker.json

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    },
    "exec-opts": ["native.cgroupdriver=cgroupfs"]
}

it is working.

YochayTzur commented 1 year ago

modify /etc/docker/docker.json

Isn't it /etc/docker/daemon.json?

rogelioamancisidor commented 12 months ago

@homjay I dont think that solution works on K8s

elezar commented 11 months ago

This is an issue as described in https://github.com/NVIDIA/nvidia-container-toolkit/issues/48

Since this issue has a number of different failure modes discussed, I'm going to close this issue and ask that those still having a problem open new issues in the respective repositories.

For docker command line usage against https://github.com/NVIDIA/nvidia-container-toolkit
For the GPU Device plugin against https://github.com/NVIDIA/k8s-device-plugin

We are looking to migrate all issues in this repo to https://github.com/NVIDIA/nvidia-container-toolkit in the near term.

NVIDIA / nvidia-docker