NVIDIA / nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
Apache License 2.0
17.2k stars 2.03k forks source link

"Failed to initialize NVML: Unknown Error" after random amount of time #1671

Closed iFede94 closed 10 months ago

iFede94 commented 2 years ago

1. Issue or feature description

After a random amount of time (it could be hours or days) the GPUs become unavailable inside all the running containers and nvidia-smi returns "Failed to initialize NVML: Unknown Error". A restart of all the containers fixes the issue and the GPUs return available. Outside the containers the GPUs are still working correctly. I tried searching in the open/closed issues but I could not find any solution.

2. Steps to reproduce the issue

All the containers are run with docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash

3. Information to attach

I0831 10:36:45.129762 2174149 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae) I0831 10:36:45.129878 2174149 nvc.c:350] using root / I0831 10:36:45.129892 2174149 nvc.c:351] using ldcache /etc/ld.so.cache I0831 10:36:45.129906 2174149 nvc.c:352] using unprivileged user 1000:1000 I0831 10:36:45.129960 2174149 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0831 10:36:45.130411 2174149 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment W0831 10:36:45.132458 2174150 nvc.c:273] failed to set inheritable capabilities W0831 10:36:45.132555 2174150 nvc.c:274] skipping kernel modules load due to failure I0831 10:36:45.133242 2174151 rpc.c:71] starting driver rpc service I0831 10:36:45.141625 2174152 rpc.c:71] starting nvcgo rpc service I0831 10:36:45.144941 2174149 nvc_info.c:766] requesting driver information with '' I0831 10:36:45.146226 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.515.48.07 I0831 10:36:45.146379 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.48.07 I0831 10:36:45.146563 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.48.07 I0831 10:36:45.146792 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07 I0831 10:36:45.146986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.515.48.07 I0831 10:36:45.147178 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.48.07 I0831 10:36:45.147375 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.48.07 I0831 10:36:45.147400 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.48.07 I0831 10:36:45.147598 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.48.07 I0831 10:36:45.147777 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.48.07 I0831 10:36:45.147986 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.48.07 I0831 10:36:45.148258 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.515.48.07 I0831 10:36:45.148506 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.515.48.07 I0831 10:36:45.148699 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.48.07 I0831 10:36:45.148915 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.48.07 I0831 10:36:45.148942 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.48.07 I0831 10:36:45.149219 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.48.07 I0831 10:36:45.149467 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.515.48.07 I0831 10:36:45.149591 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.515.48.07 I0831 10:36:45.149814 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.515.48.07 I0831 10:36:45.149996 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.515.48.07 I0831 10:36:45.150224 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07 I0831 10:36:45.150437 2174149 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.48.07 I0831 10:36:45.150772 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.515.48.07 I0831 10:36:45.150978 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.515.48.07 I0831 10:36:45.151147 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.515.48.07 I0831 10:36:45.151335 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.515.48.07 I0831 10:36:45.151592 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.515.48.07 I0831 10:36:45.151786 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.515.48.07 I0831 10:36:45.151970 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.515.48.07 I0831 10:36:45.152225 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.515.48.07 I0831 10:36:45.152480 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.515.48.07 I0831 10:36:45.152791 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.515.48.07 I0831 10:36:45.152999 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.515.48.07 I0831 10:36:45.153254 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.515.48.07 I0831 10:36:45.153580 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.515.48.07 I0831 10:36:45.153853 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.515.48.07 I0831 10:36:45.154063 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.515.48.07 I0831 10:36:45.154259 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.515.48.07 I0831 10:36:45.154473 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.515.48.07 I0831 10:36:45.154696 2174149 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.515.48.07 W0831 10:36:45.154723 2174149 nvc_info.c:399] missing library libnvidia-nscq.so W0831 10:36:45.154726 2174149 nvc_info.c:399] missing library libcudadebugger.so W0831 10:36:45.154729 2174149 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so W0831 10:36:45.154731 2174149 nvc_info.c:399] missing library libnvidia-pkcs11.so W0831 10:36:45.154733 2174149 nvc_info.c:399] missing library libvdpau_nvidia.so W0831 10:36:45.154735 2174149 nvc_info.c:399] missing library libnvidia-ifr.so W0831 10:36:45.154737 2174149 nvc_info.c:399] missing library libnvidia-cbl.so W0831 10:36:45.154739 2174149 nvc_info.c:403] missing compat32 library libnvidia-cfg.so W0831 10:36:45.154741 2174149 nvc_info.c:403] missing compat32 library libnvidia-nscq.so W0831 10:36:45.154743 2174149 nvc_info.c:403] missing compat32 library libcudadebugger.so W0831 10:36:45.154746 2174149 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so W0831 10:36:45.154748 2174149 nvc_info.c:403] missing compat32 library libnvidia-allocator.so W0831 10:36:45.154750 2174149 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so W0831 10:36:45.154752 2174149 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W0831 10:36:45.154754 2174149 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so W0831 10:36:45.154756 2174149 nvc_info.c:403] missing compat32 library libnvidia-ifr.so W0831 10:36:45.154758 2174149 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W0831 10:36:45.154760 2174149 nvc_info.c:403] missing compat32 library libnvoptix.so W0831 10:36:45.154762 2174149 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I0831 10:36:45.154919 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-smi I0831 10:36:45.154945 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I0831 10:36:45.154954 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I0831 10:36:45.154970 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control I0831 10:36:45.154980 2174149 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server W0831 10:36:45.155027 2174149 nvc_info.c:425] missing binary nv-fabricmanager I0831 10:36:45.155044 2174149 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/515.48.07/gsp.bin I0831 10:36:45.155058 2174149 nvc_info.c:529] listing device /dev/nvidiactl I0831 10:36:45.155061 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm I0831 10:36:45.155063 2174149 nvc_info.c:529] listing device /dev/nvidia-uvm-tools I0831 10:36:45.155065 2174149 nvc_info.c:529] listing device /dev/nvidia-modeset I0831 10:36:45.155080 2174149 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket W0831 10:36:45.155092 2174149 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket W0831 10:36:45.155100 2174149 nvc_info.c:349] missing ipc path /tmp/nvidia-mps I0831 10:36:45.155102 2174149 nvc_info.c:822] requesting device information with '' I0831 10:36:45.161039 2174149 nvc_info.c:713] listing device /dev/nvidia0 (GPU-13fd0930-06c3-5975-8720-72c72ee7a823 at 00000000:01:00.0) I0831 10:36:45.166471 2174149 nvc_info.c:713] listing device /dev/nvidia1 (GPU-a76d37d7-5ed0-58d9-6087-b18fee984570 at 00000000:02:00.0) NVRM version: 515.48.07 CUDA version: 11.7

Device Index: 0 Device Minor: 0 Model: NVIDIA GeForce RTX 2080 Ti Brand: GeForce GPU UUID: GPU-13fd0930-06c3-5975-8720-72c72ee7a823 Bus Location: 00000000:01:00.0 Architecture: 7.5

Device Index: 1 Device Minor: 1 Model: NVIDIA GeForce RTX 2080 Ti Brand: GeForce GPU UUID: GPU-a76d37d7-5ed0-58d9-6087-b18fee984570 Bus Location: 00000000:02:00.0 Architecture: 7.5 I0831 10:36:45.166493 2174149 nvc.c:434] shutting down library context I0831 10:36:45.166540 2174152 rpc.c:95] terminating nvcgo rpc service I0831 10:36:45.166751 2174149 rpc.c:135] nvcgo rpc service terminated successfully I0831 10:36:45.167790 2174151 rpc.c:95] terminating driver rpc service I0831 10:36:45.167907 2174149 rpc.c:135] driver rpc service terminated successfully


 - [X] Kernel version from `uname -a`

Linux wds-co-ml 5.15.0-43-generic NVIDIA/nvidia-docker#46-Ubuntu SMP Tue Jul 12 10:30:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux


 - [X] Driver information from `nvidia-smi -a`

==============NVSMI LOG==============

Timestamp : Wed Aug 31 12:42:55 2022 Driver Version : 515.48.07 CUDA Version : 11.7

Attached GPUs : 2 GPU 00000000:01:00.0 Product Name : NVIDIA GeForce RTX 2080 Ti Product Brand : GeForce Product Architecture : Turing Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-13fd0930-06c3-5975-8720-72c72ee7a823 Minor Number : 0 VBIOS Version : 90.02.0B.00.C7 MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Module ID : 0 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1E0710DE Bus Id : 00000000:01:00.0 Sub System Id : 0x150319DA GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 0 % Performance State : P8 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11264 MiB Reserved : 244 MiB Used : 1 MiB Free : 11018 MiB BAR1 Memory Usage Total : 256 MiB Used : 3 MiB Free : 253 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 30 C GPU Shutdown Temp : 94 C GPU Slowdown Temp : 91 C GPU Max Operating Temp : 89 C GPU Target Temperature : 84 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 20.87 W Power Limit : 260.00 W Default Power Limit : 260.00 W Enforced Power Limit : 260.00 W Min Power Limit : 100.00 W Max Power Limit : 300.00 W Clocks Graphics : 300 MHz SM : 300 MHz Memory : 405 MHz Video : 540 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2160 MHz SM : 2160 MHz Memory : 7000 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes : None

GPU 00000000:02:00.0 Product Name : NVIDIA GeForce RTX 2080 Ti Product Brand : GeForce Product Architecture : Turing Display Mode : Disabled Display Active : Disabled Persistence Mode : Disabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-a76d37d7-5ed0-58d9-6087-b18fee984570 Minor Number : 1 VBIOS Version : 90.02.17.00.58 MultiGPU Board : No Board ID : 0x200 GPU Part Number : N/A Module ID : 0 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x02 Device : 0x00 Domain : 0x0000 Device Id : 0x1E0710DE Bus Id : 00000000:02:00.0 Sub System Id : 0x150319DA GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 8x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : 35 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 11264 MiB Reserved : 244 MiB Used : 1 MiB Free : 11018 MiB BAR1 Memory Usage Total : 256 MiB Used : 27 MiB Free : 229 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 28 C GPU Shutdown Temp : 94 C GPU Slowdown Temp : 91 C GPU Max Operating Temp : 89 C GPU Target Temperature : 84 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 6.66 W Power Limit : 260.00 W Default Power Limit : 260.00 W Enforced Power Limit : 260.00 W Min Power Limit : 100.00 W Max Power Limit : 300.00 W Clocks Graphics : 300 MHz SM : 300 MHz Memory : 405 MHz Video : 540 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2160 MHz SM : 2160 MHz Memory : 7000 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes : None


 - [X] Docker version from `docker version`

Client: Docker Engine - Community Version: 20.10.17 API version: 1.41 Go version: go1.17.11 Git commit: 100c701 Built: Mon Jun 6 23:02:46 2022 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.17 API version: 1.41 (minimum version 1.12) Go version: go1.17.11 Git commit: a89b842 Built: Mon Jun 6 23:00:51 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.6 GitCommit: 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1 runc: Version: 1.1.2 GitCommit: v1.1.2-0-ga916309 docker-init: Version: 0.19.0 GitCommit: de40ad0


 - [X] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

ii libnvidia-cfg1-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA binary OpenGL/GLX configuration library ii libnvidia-common-515 515.48.07-0ubuntu0.22.04.2 all Shared files used by the NVIDIA libraries ii libnvidia-compute-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA libcompute package ii libnvidia-compute-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA libcompute package ii libnvidia-container-tools 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.10.0-1 amd64 NVIDIA container runtime library ii libnvidia-decode-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA Video Decoding runtime libraries ii libnvidia-decode-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA Video Decoding runtime libraries ii libnvidia-egl-wayland1:amd64 1:1.1.9-1.1 amd64 Wayland EGL External Platform library -- shared library ii libnvidia-encode-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVENC Video Encoding runtime library ii libnvidia-encode-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVENC Video Encoding runtime library ii libnvidia-extra-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 Extra libraries for the NVIDIA driver ii libnvidia-fbc1-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library ii libnvidia-fbc1-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library ii libnvidia-gl-515:amd64 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD ii libnvidia-gl-515:i386 515.48.07-0ubuntu0.22.04.2 i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD ii linux-modules-nvidia-515-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel nvidia modules for version 5.15.0-43 ii linux-modules-nvidia-515-generic-hwe-22.04 5.15.0-43.46 amd64 Extra drivers for nvidia-515 for the generic-hwe-22.04 flavour ii linux-objects-nvidia-515-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel nvidia modules for version 5.15.0-43 (objects) ii linux-signatures-nvidia-5.15.0-43-generic 5.15.0-43.46 amd64 Linux kernel signatures for nvidia modules for version 5.15.0-43-generic ii nvidia-compute-utils-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA compute utilities ii nvidia-container-toolkit 1.10.0-1 amd64 NVIDIA container runtime hook ii nvidia-docker2 2.11.0-1 all nvidia-docker CLI wrapper ii nvidia-driver-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA driver metapackage ii nvidia-kernel-common-515 515.48.07-0ubuntu0.22.04.2 amd64 Shared files used with the kernel module ii nvidia-kernel-source-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA kernel source package ii nvidia-prime 0.8.17.1 all Tools to enable NVIDIA's Prime ii nvidia-settings 510.47.03-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver ii nvidia-utils-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA driver support binaries ii xserver-xorg-video-nvidia-515 515.48.07-0ubuntu0.22.04.2 amd64 NVIDIA binary Xorg driver


 - [X] NVIDIA container library version from `nvidia-container-cli -V`

cli-version: 1.10.0 lib-version: 1.10.0 build date: 2022-06-13T10:39+00:00 build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

 - [X] Docker command, image and tag used
```bash
docker run --gpus all -it tensorflow/tensorflow:latest-gpu /bin/bash
elezar commented 2 years ago

The nvidia-smi output show persistence mode as being disabled. Does the behaviour still exist when this is enabled?

kevin-bockman commented 2 years ago

Hey, I have the same problem.

2. Steps to reproduce the issue

docker run --gpus all --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash
root@098b49afe624:/# nvidia-smi 
Fri Sep  2 21:54:31 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.68.02    Driver Version: 510.68.02    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+

This works until you do systemctl daemon-reload either manually or automatically through the OS (I assume, since it eventually will fail).

(on host): systemctl daemon-reload

(inside same running container):

root@098b49afe624:/# nvidia-smi 
Failed to initialize NVML: Unknown Error

Running the container again will work fine until you do another systemctl daemon-reload.

3. Information to attach (optional if deemed irrelevant)

Device Index: 0
Device Minor: 0
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-9c416c82-d801-d28f-0867-dd438d4be914
Bus Location: 00000000:04:00.0
Architecture: 6.1

Device Index: 1
Device Minor: 1
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a
Bus Location: 00000000:05:00.0
Architecture: 6.1

Device Index: 2
Device Minor: 2
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe
Bus Location: 00000000:08:00.0
Architecture: 6.1

Device Index: 3
Device Minor: 3
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-1ab2485c-121c-77db-6719-0b616d1673f4
Bus Location: 00000000:09:00.0
Architecture: 6.1

Device Index: 4
Device Minor: 4
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c
Bus Location: 00000000:0b:00.0
Architecture: 6.1

Device Index: 5
Device Minor: 5
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-c16444fb-bedb-106d-c188-1f330773cf39
Bus Location: 00000000:84:00.0
Architecture: 6.1

Device Index: 6
Device Minor: 6
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0
Bus Location: 00000000:85:00.0
Architecture: 6.1

Device Index: 7
Device Minor: 7
Model: NVIDIA TITAN X (Pascal)
Brand: TITAN
GPU UUID: GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28
Bus Location: 00000000:89:00.0
Architecture: 6.1
I0902 21:40:53.687293 2836338 nvc.c:434] shutting down library context
I0902 21:40:53.687347 2836341 rpc.c:95] terminating nvcgo rpc service
I0902 21:40:53.687881 2836338 rpc.c:135] nvcgo rpc service terminated successfully
I0902 21:40:53.692819 2836340 rpc.c:95] terminating driver rpc service
I0902 21:40:53.693046 2836338 rpc.c:135] driver rpc service terminated successfully


 - [x] Kernel version from `uname -a`
 `Linux node5-4 5.15.0-46-generic NVIDIA/nvidia-docker#49-Ubuntu SMP Thu Aug 4 18:03:25 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux`

 - [x] Any relevant kernel output lines from `dmesg`
Nothing relevant from dmesg, but only thing relevant from journalctl is
`Sep 02 21:17:56 node5-4 systemd[1]: Reloading.` once I do a `systemctl daemon-reload`

 - [x] Driver information from `nvidia-smi -a`

Fri Sep 2 21:22:32 2022
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.68.02 Driver Version: 510.68.02 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA TITAN X ... On | 00000000:04:00.0 Off | N/A | | 23% 23C P8 8W / 250W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA TITAN X ... On | 00000000:05:00.0 Off | N/A | | 23% 26C P8 9W / 250W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA TITAN X ... On | 00000000:08:00.0 Off | N/A | | 23% 22C P8 7W / 250W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA TITAN X ... On | 00000000:09:00.0 Off | N/A | | 23% 24C P8 8W / 250W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 4 NVIDIA TITAN X ... On | 00000000:0B:00.0 Off | N/A | | 23% 26C P8 9W / 250W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 5 NVIDIA TITAN X ... On | 00000000:84:00.0 Off | N/A | | 23% 25C P8 8W / 250W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 6 NVIDIA TITAN X ... On | 00000000:85:00.0 Off | N/A | | 23% 22C P8 8W / 250W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 7 NVIDIA TITAN X ... On | 00000000:89:00.0 Off | N/A | | 23% 23C P8 7W / 250W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+


 - [x] Docker version from `docker version`

Client: Docker Engine - Community Version: 20.10.17 API version: 1.41 Go version: go1.17.11 Git commit: 100c701 Built: Mon Jun 6 23:02:46 2022 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.17 API version: 1.41 (minimum version 1.12) Go version: go1.17.11 Git commit: a89b842 Built: Mon Jun 6 23:00:51 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.4 GitCommit: 212e8b6fa2f44b9c21b2798135fc6fb7c53efc16 runc: Version: 1.1.1 GitCommit: v1.1.1-0-g52de29d docker-init: Version: 0.19.0 GitCommit: de40ad0


 - [x] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=============================-============-============-===================================================== ii libnvidia-container-tools 1.10.0-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.10.0-1 amd64 NVIDIA container runtime library ii nvidia-container-runtime 3.10.0-1 all NVIDIA container runtime un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.10.0-1 amd64 NVIDIA container runtime hook


 - [x] NVIDIA container library version from `nvidia-container-cli -V`

cli-version: 1.10.0 lib-version: 1.10.0 build date: 2022-06-13T10:39+00:00 build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections


 - [x] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))

I0902 22:11:39.880399 2840718 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae) I0902 22:11:39.880483 2840718 nvc.c:350] using root / I0902 22:11:39.880501 2840718 nvc.c:351] using ldcache /etc/ld.so.cache
I0902 22:11:39.880514 2840718 nvc.c:352] using unprivileged user 65534:65534
I0902 22:11:39.880559 2840718 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0902 22:11:39.880751 2840718 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment I0902 22:11:39.884769 2840724 nvc.c:278] loading kernel module nvidia
I0902 22:11:39.884931 2840724 nvc.c:282] running mknod for /dev/nvidiactl
I0902 22:11:39.884991 2840724 nvc.c:286] running mknod for /dev/nvidia0
I0902 22:11:39.885033 2840724 nvc.c:286] running mknod for /dev/nvidia1
I0902 22:11:39.885071 2840724 nvc.c:286] running mknod for /dev/nvidia2
I0902 22:11:39.885109 2840724 nvc.c:286] running mknod for /dev/nvidia3
I0902 22:11:39.885147 2840724 nvc.c:286] running mknod for /dev/nvidia4
I0902 22:11:39.885185 2840724 nvc.c:286] running mknod for /dev/nvidia5
I0902 22:11:39.885222 2840724 nvc.c:286] running mknod for /dev/nvidia6
I0902 22:11:39.885260 2840724 nvc.c:286] running mknod for /dev/nvidia7
I0902 22:11:39.885298 2840724 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps I0902 22:11:39.892775 2840724 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config I0902 22:11:39.892935 2840724 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor I0902 22:11:39.899624 2840724 nvc.c:296] loading kernel module nvidia_uvm I0902 22:11:39.899673 2840724 nvc.c:300] running mknod for /dev/nvidia-uvm I0902 22:11:39.899778 2840724 nvc.c:305] loading kernel module nvidia_modeset
I0902 22:11:39.899820 2840724 nvc.c:309] running mknod for /dev/nvidia-modeset
I0902 22:11:39.900186 2840725 rpc.c:71] starting driver rpc service I0902 22:11:39.911718 2840726 rpc.c:71] starting nvcgo rpc service I0902 22:11:39.912892 2840718 nvc_container.c:240] configuring container with 'compute utility supervised' I0902 22:11:39.913283 2840718 nvc_container.c:88] selecting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libcuda.so.470.129.06 I0902 22:11:39.913368 2840718 nvc_container.c:88] selecting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libnvidia-ptxjitcompiler.so.470.129.06 I0902 22:11:39.915116 2840718 nvc_container.c:262] setting pid to 2840712 I0902 22:11:39.915147 2840718 nvc_container.c:263] setting rootfs to /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged I0902 22:11:39.915160 2840718 nvc_container.c:264] setting owner to 0:0 I0902 22:11:39.915171 2840718 nvc_container.c:265] setting bins directory to /usr/bin I0902 22:11:39.915182 2840718 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu I0902 22:11:39.915193 2840718 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu I0902 22:11:39.915204 2840718 nvc_container.c:268] setting cudart directory to /usr/local/cuda I0902 22:11:39.915215 2840718 nvc_container.c:269] setting ldconfig to @/sbin/ldconfig.real (host relative) I0902 22:11:39.915228 2840718 nvc_container.c:270] setting mount namespace to /proc/2840712/ns/mnt I0902 22:11:39.915240 2840718 nvc_container.c:272] detected cgroupv2 I0902 22:11:39.915271 2840718 nvc_container.c:273] setting devices cgroup to /sys/fs/cgroup/system.slice/docker-5fff6f80850791d3858cb511015581375d55ae42df5eb98262ceae31ed47a7d5.scope I0902 22:11:39.915292 2840718 nvc_info.c:766] requesting driver information with '' I0902 22:11:39.916901 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.510.68.02 I0902 22:11:39.917076 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.510.68.02 I0902 22:11:39.917165 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.510.68.02
I0902 22:11:39.917236 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.510.68.02 I0902 22:11:39.917318 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02
I0902 22:11:39.917411 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.510.68.02 I0902 22:11:39.917503 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02 I0902 22:11:39.917574 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.510.68.02 I0902 22:11:39.917639 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02 I0902 22:11:39.917730 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.510.68.02 I0902 22:11:39.917794 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.510.68.02
I0902 22:11:39.917859 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.510.68.02
I0902 22:11:39.917926 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.510.68.02 I0902 22:11:39.918018 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.510.68.02 I0902 22:11:39.918109 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.510.68.02
I0902 22:11:39.918176 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02
I0902 22:11:39.918243 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02
I0902 22:11:39.918335 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02
I0902 22:11:39.918429 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.510.68.02
I0902 22:11:39.918628 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02
I0902 22:11:39.918758 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.510.68.02
I0902 22:11:39.918827 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.510.68.02
I0902 22:11:39.918896 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.510.68.02
I0902 22:11:39.918968 2840718 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.510.68.02 W0902 22:11:39.919005 2840718 nvc_info.c:399] missing library libnvidia-nscq.so W0902 22:11:39.919022 2840718 nvc_info.c:399] missing library libcudadebugger.so W0902 22:11:39.919035 2840718 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so W0902 22:11:39.919049 2840718 nvc_info.c:399] missing library libnvidia-pkcs11.so
W0902 22:11:39.919061 2840718 nvc_info.c:399] missing library libnvidia-ifr.so
W0902 22:11:39.919074 2840718 nvc_info.c:399] missing library libnvidia-cbl.so W0902 22:11:39.919088 2840718 nvc_info.c:403] missing compat32 library libnvidia-ml.so W0902 22:11:39.919107 2840718 nvc_info.c:403] missing compat32 library libnvidia-cfg.so W0902 22:11:39.919119 2840718 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W0902 22:11:39.919131 2840718 nvc_info.c:403] missing compat32 library libcuda.so
W0902 22:11:39.919144 2840718 nvc_info.c:403] missing compat32 library libcudadebugger.so
W0902 22:11:39.919156 2840718 nvc_info.c:403] missing compat32 library libnvidia-opencl.so W0902 22:11:39.919168 2840718 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so
W0902 22:11:39.919192 2840718 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so W0902 22:11:39.919206 2840718 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W0902 22:11:39.919218 2840718 nvc_info.c:403] missing compat32 library libnvidia-compiler.so W0902 22:11:39.919230 2840718 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W0902 22:11:39.919242 2840718 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W0902 22:11:39.919254 2840718 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so W0902 22:11:39.919266 2840718 nvc_info.c:403] missing compat32 library libnvidia-encode.so W0902 22:11:39.919279 2840718 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so W0902 22:11:39.919291 2840718 nvc_info.c:403] missing compat32 library libnvcuvid.so W0902 22:11:39.919304 2840718 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so W0902 22:11:39.919317 2840718 nvc_info.c:403] missing compat32 library libnvidia-glcore.so
W0902 22:11:39.919329 2840718 nvc_info.c:403] missing compat32 library libnvidia-tls.so W0902 22:11:39.919341 2840718 nvc_info.c:403] missing compat32 library libnvidia-glsi.so W0902 22:11:39.919353 2840718 nvc_info.c:403] missing compat32 library libnvidia-fbc.so W0902 22:11:39.919365 2840718 nvc_info.c:403] missing compat32 library libnvidia-ifr.so W0902 22:11:39.919377 2840718 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W0902 22:11:39.919388 2840718 nvc_info.c:403] missing compat32 library libnvoptix.so W0902 22:11:39.919401 2840718 nvc_info.c:403] missing compat32 library libGLX_nvidia.so W0902 22:11:39.919413 2840718 nvc_info.c:403] missing compat32 library libEGL_nvidia.so W0902 22:11:39.919426 2840718 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so W0902 22:11:39.919438 2840718 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so W0902 22:11:39.919451 2840718 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so W0902 22:11:39.919463 2840718 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I0902 22:11:39.919856 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-smi I0902 22:11:39.919895 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I0902 22:11:39.919931 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I0902 22:11:39.919985 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control I0902 22:11:39.920022 2840718 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server W0902 22:11:39.920096 2840718 nvc_info.c:425] missing binary nv-fabricmanager I0902 22:11:39.920152 2840718 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/510.68.02/gsp.bin I0902 22:11:39.920200 2840718 nvc_info.c:529] listing device /dev/nvidiactl
I0902 22:11:39.920215 2840718 nvc_info.c:529] listing device /dev/nvidia-uvm
I0902 22:11:39.920228 2840718 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I0902 22:11:39.920240 2840718 nvc_info.c:529] listing device /dev/nvidia-modeset
W0902 22:11:39.920281 2840718 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket
W0902 22:11:39.920324 2840718 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W0902 22:11:39.920355 2840718 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I0902 22:11:39.920371 2840718 nvc_info.c:822] requesting device information with ''
I0902 22:11:39.927586 2840718 nvc_info.c:713] listing device /dev/nvidia0 (GPU-9c416c82-d801-d28f-0867-dd438d4be914 at 00000000:04:00.0)
I0902 22:11:39.934626 2840718 nvc_info.c:713] listing device /dev/nvidia1 (GPU-32a56b8c-943e-03e7-d539-3e97e5ef5f7a at 00000000:05:00.0)
I0902 22:11:39.941796 2840718 nvc_info.c:713] listing device /dev/nvidia2 (GPU-a0e33485-87cd-ceb1-2702-2c58a64a9dbe at 00000000:08:00.0) I0902 22:11:39.949011 2840718 nvc_info.c:713] listing device /dev/nvidia3 (GPU-1ab2485c-121c-77db-6719-0b616d1673f4 at 00000000:09:00.0) I0902 22:11:39.956304 2840718 nvc_info.c:713] listing device /dev/nvidia4 (GPU-e7e3d7b6-ddce-355a-7988-80c4ba18319c at 00000000:0b:00.0)
I0902 22:11:39.963862 2840718 nvc_info.c:713] listing device /dev/nvidia5 (GPU-c16444fb-bedb-106d-c188-1f330773cf39 at 00000000:84:00.0)
I0902 22:11:39.971543 2840718 nvc_info.c:713] listing device /dev/nvidia6 (GPU-2545ac9e-3ff1-8b38-8ad6-b8c82fea6cd0 at 00000000:85:00.0)
I0902 22:11:39.979406 2840718 nvc_info.c:713] listing device /dev/nvidia7 (GPU-fcc35ab7-1afd-e678-b5f0-d1e1f8842d28 at 00000000:89:00.0)
I0902 22:11:39.979522 2840718 nvc_mount.c:366] mounting tmpfs at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia
I0902 22:11:39.980084 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-smi
I0902 22:11:39.980181 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-debugdump
I0902 22:11:39.980273 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-persistenced
I0902 22:11:39.980360 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-control at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-cuda-mps-control
I0902 22:11:39.980443 2840718 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-server at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/bin/nvidia-cuda-mps-server
I0902 22:11:39.980696 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.68.02
I0902 22:11:39.980795 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.68.02
I0902 22:11:39.980919 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so.510.68.02 I0902 22:11:39.981004 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.68.02 I0902 22:11:39.981090 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.68.02 I0902 22:11:39.981182 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.68.02 I0902 22:11:39.981272 2840718 nvc_mount.c:134] mounting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.68.02 I0902 22:11:39.981314 2840718 nvc_mount.c:527] creating symlink /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1 I0902 22:11:39.981482 2840718 nvc_mount.c:134] mounting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libcuda.so.470.129.06 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libcuda.so.470.129.06 I0902 22:11:39.981569 2840718 nvc_mount.c:134] mounting /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/local/cuda-11.4/compat/libnvidia-ptxjitcompiler.so.470.129.06 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.129.06 I0902 22:11:39.981887 2840718 nvc_mount.c:85] mounting /usr/lib/firmware/nvidia/510.68.02/gsp.bin at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/lib/firmware/nvidia/510.68.02/gsp.bin with flags 0x7 I0902 22:11:39.981971 2840718 nvc_mount.c:230] mounting /dev/nvidiactl at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidiactl I0902 22:11:39.982876 2840718 nvc_mount.c:230] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia-uvm I0902 22:11:39.983470 2840718 nvc_mount.c:230] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia-uvm-tools I0902 22:11:39.983976 2840718 nvc_mount.c:230] mounting /dev/nvidia0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia0 I0902 22:11:39.984099 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:04:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:04:00.0 I0902 22:11:39.984695 2840718 nvc_mount.c:230] mounting /dev/nvidia1 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia1 I0902 22:11:39.984812 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:05:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:05:00.0 I0902 22:11:39.985425 2840718 nvc_mount.c:230] mounting /dev/nvidia2 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia2 I0902 22:11:39.985541 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:08:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:08:00.0 I0902 22:11:39.986207 2840718 nvc_mount.c:230] mounting /dev/nvidia3 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia3 I0902 22:11:39.986322 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:09:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:09:00.0 I0902 22:11:39.986963 2840718 nvc_mount.c:230] mounting /dev/nvidia4 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia4 I0902 22:11:39.987076 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:0b:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:0b:00.0 I0902 22:11:39.987794 2840718 nvc_mount.c:230] mounting /dev/nvidia5 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia5 I0902 22:11:39.987907 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:84:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:84:00.0 I0902 22:11:39.988593 2840718 nvc_mount.c:230] mounting /dev/nvidia6 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia6 I0902 22:11:39.988707 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:85:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:85:00.0 I0902 22:11:39.989388 2840718 nvc_mount.c:230] mounting /dev/nvidia7 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/dev/nvidia7 I0902 22:11:39.989515 2840718 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:89:00.0 at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged/proc/driver/nvidia/gpus/0000:89:00.0 I0902 22:11:39.990197 2840718 nvc_ldcache.c:372] executing /sbin/ldconfig.real from host at /var/lib/docker/overlay2/3ae89034ae9bd6d26c73c7c2587c80de8dc36ef8485b569f323cb6933c838e45/merged I0902 22:11:40.012422 2840718 nvc.c:434] shutting down library context I0902 22:11:40.012510 2840726 rpc.c:95] terminating nvcgo rpc service I0902 22:11:40.013110 2840718 rpc.c:135] nvcgo rpc service terminated successfully I0902 22:11:40.018693 2840725 rpc.c:95] terminating driver rpc service I0902 22:11:40.018995 2840718 rpc.c:135] driver rpc service terminated successfully


 - [x] Docker command, image and tag used

docker run --gpus all --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash nvidia-smi



### Other open issues
NVIDIA/nvidia-container-toolkit#251 but this is using cgroup v1
NVIDIA/nvidia-docker#1661 there isn't any information posted and it's on Ubuntu 20.04 instead of 22.04

### Important notes / workaround
containerd.io v1.6.7 or v1.6.8 even with `no-cgroups = true` in `/etc/nvidia-container-runtime/config.toml` and specifying the devices to `docker run` gives `Failed to initialize NVML: Unknown Error` after a `systemctl daemon-reload`.

Downgrading containerd.io to 1.6.6 works as long as you specify `no-cgroups = true` in `/etc/nvidia-container-runtime/config.toml` and specify the devices to `docker run` like `docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash`
kevin-bockman commented 2 years ago

@elezar Previously persistence mode was off, so this happens either way.

Also, on k8s-device-plugin/issues/289 @klueska said: The only thing we've seen that fully resolves the issue is to upgrade to an "experimental" version of our NVIDIA container runtime that bypasses the need for libnvidia-container to change cgroup permissions out from underneath runC. Was that merged, or is it something I should try?

elezar commented 2 years ago

@kevin-bockman the experimental mode is still a work in progress and we don't have a concrete timeline on when this will be available for testing. I will update the issue here as soon as I have more information.

klueska commented 2 years ago

The other option is to move to cgroupv2. Since devices are not an actual subsytem in cgroupv2, there is no chance for containerd to undo what libnvidia-container has done under the hood after a restart.

kevin-bockman commented 2 years ago

@klueska Sorry, with all of the information, it wasn't really clear. The problem is that it's already on cgroupv2 AFAIK. I started from a fresh install of Ubuntu 22.04.1. docker info says it is at least.

The only way I could get this to work after a systemctl daemon-reload is downgrading containerd.io to 1.6.6 and specify no-cgroups. The other interesting thing is with containerd v1.6.7 or v1.6.8, even specifying no-cgroups still had the issue so I'm wondering if there's more than 1 issue here. I know cgroup v2 has 'fixed' the issue for some people or so they think (this can be an intermittent issue if you don't know that the reload triggers it), but it hasn't seemed to fix it for everyone unless I'm missing something but it doesn't work on a fresh install after doing a daemon reload, or just waiting for something to be triggered by the OS.

$ docker info
Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.8.2-docker)

Server:
 Containers: 4
  Running: 4
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 20.10.17
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
 runc version: v1.1.4-0-g5fd4c4d
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
  cgroupns
 Kernel Version: 5.15.0-46-generic
 Operating System: Ubuntu 22.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 32
 Total Memory: 94.36GiB
 Name: node5-4
 ID: PPB6:APYD:PKMA:BIOZ:2Y3H:LZUV:TPHD:SBZE:XRSL:NJCB:PWMX:ZVBY
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false
mf-giwoong-lee commented 2 years ago

@kevin-bockman I had a similar experience.

In my case,

docker run -it --device /dev/nvidiactl:/dev/nvidiactl \
--device /dev/nvidia-uvm:/dev/nvidia-uvm  \
--device /dev/nvidia0:/dev/nvidia0 \
--device /dev/nvidia1:/dev/nvidia1 \
--device /dev/nvidia2:/dev/nvidia2 \
--device /dev/nvidia3:/dev/nvidia3 \
--name <container_name> <image_name>
(Replace/repeat nvidia0 with other/more devices as needed.)

This setting is working in some machines and not working in other machines. Finally, I found that working machines has containerd.io version 1.4.6-1 (ubuntu 18.04)!!! In ubuntu 20.04 machine, containerd.io which has version 1.5.2-1 makes it work.

I tried to downgrade and upgrade the version of containerd.io to check this strategy works or not. It works for me.

mf-giwoong-lee commented 2 years ago

Above one is not the answer...

This prevents nmvl error from docker resource update, but nvml error still occurs after random amount of time.

theluke commented 1 year ago

Same issue. Ubuntu 22,docker ce. I will just end up writing a cron job script to check for the error and restart the container

iFede94 commented 1 year ago

The solution proposed by @kevin-bockman has been working without any problem for more than a month now.

Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash

theluke commented 1 year ago

I am using docker-ce on Ubuntu 22, so I opted for this approach, working fine so far.

myron commented 1 year ago

same issue on Nvidia 3090 Ubuntu 22.04.1 LTS, Driver Version: 510.85.02 CUDA Version: 11.6

fradsj commented 1 year ago

Hello there.

I'm hitting the same issue here, but with containerd rather than docker.

Here's my configuration:

Note that the Nvidia's container toolkit has been installed with the Nvidia's GPU operator on Kubernetes (v1.25.3).

I attached the containerd configuration file and the nvidia-container-runtime configuration file to my comment. containerd.txt nvidia-container-runtime.txt

How I reproduce this bug:

Running on my host the following command:

# nerdctl run -n k8s.io --runtime=/usr/local/nvidia/toolkit/nvidia-container-runtime --network=host --rm -ti --name ubuntu --gpus all -v /run/nvidia/driver/usr/bin:/tmp/nvidia-bin docker.io/library/ubuntu:latest bash

After some time, the nvidia-smicommand exits with the error Failed to initialize NVML: Unknown Error.

Traces, logs, etc...

Thank you very much for your help. 🙏

gengwg commented 1 year ago

Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.

https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

GuillaumeSmaha commented 1 year ago

Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.

https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

@gengwg Can you try if your solution works by calling sudo systemctl daemon-reload on the host? In my case (cgroupv1), it is directly breaking the pod ; so from the pod, nvidia-smi is returning Failed to initialize NVML: Unknown Error.

gengwg commented 1 year ago

yes. that's actually the first thing i tested when upgraded v1 --> v2. it's easy to test, because it doesn't need wait a few hours/days.

to double check, i just tested it again right now.

Before:

$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)

Do the reload on that node itself:

# systemctl daemon-reload

After:

$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-212a30ad-0ea4-8201-1be0-cdc575e55034)

I will update the note to reflect this test too.

gengwg commented 1 year ago

And I can also confirm that's what I saw on our cgroupv1 nodes too, i.e. sudo systemctl daemon-reload immediately breaks nvidia-smi.

panli889 commented 1 year ago

Here I wrote the detailed steps how I fixed this issue in our env with cgroup v2. Let me know if it works in your env.

https://gist.github.com/gengwg/55b3eb2bc22bcbd484fccbc0978484fc

Hi, what's your cgroup driver for kubelet and containerd? We meed the same problem in cgroup v2, our cgroup driver is systemd, but if we switch the cgroup driver to cgroupfs, the problem will disappear. I think it's the systemd cgroup driver cause the problem.

Also, if we switch the cgroup driver of docker to cgroupfs, it will also solve the problem.

panli889 commented 1 year ago

Important notes / workaround

containerd.io v1.6.7 or v1.6.8 even with no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specifying the devices to docker run gives Failed to initialize NVML: Unknown Error after a systemctl daemon-reload.

Downgrading containerd.io to 1.6.6 works as long as you specify no-cgroups = true in /etc/nvidia-container-runtime/config.toml and specify the devices to docker run like docker run --gpus all --device /dev/nvidia0:/dev/nvidia0 --device /dev/nvidia-modeset:/dev/nvidia-modeset --device /dev/nvidia-uvm:/dev/nvidia-uvm --device /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools --device /dev/nvidiactl:/dev/nvinvidiactl --rm -it nvidia/cuda:11.4.2-base-ubuntu18.04 bash

I've also tried this way, the reason why containerd 1.6.7 can't work is because runc has been updated to 1.1.3, in this version runc will ignore some char devices can't be os.Stat in this PR. Unfortunately, the GPU related device is that kind of device, so it will go wrong.

fradsj commented 1 year ago

@gengwg Thanks for sharing your document. As I run my kubernetes cluster on ubuntu 22.04, cgroupv2 is the default cgroup subsystem used.

I deployed two environments to help me making some comparisons:

Interestingly, I never face this issue on the second environment, everything is running perfectly well.

The first environment though is running into this issue after some time.

That would probably means that Nvidia's container runtime isn't the faulty component here, but it needs more investigations on my side to be sure that I'm not missing anything.

I'll have a look at the cgroup driver as @panli889 mentioned.

Thanks again for your help

gengwg commented 1 year ago

cgroup driver for kubelet, docker and containerd are all systemd. In fact, in cgroupv1 we used to use cgroupfs, but kubelet won't start, complaining mismatch between kubelet and docker cgroup drivers. After that I changed the docker (and containerd) cgroup driver to systemd, kubelet was able to start.

# cat /etc/systemd/system/kubelet.service | grep -i cgroup
  --runtime-cgroups=/systemd/system.slice \
  --kubelet-cgroups=/systemd/system.slice \
  --cgroup-driver=systemd \

We are in the middle of migrating docker to containerd, so we have both docker and containerd nodes. This seem fixed it for BOTH.

Docker nodes:

# docker info | grep -i cgroup
WARNING: No swap limit support
 Cgroup Driver: systemd
 Cgroup Version: 2
  cgroupns

Containerd nodes:

$ sudo crictl info | grep -i cgroup
            "SystemdCgroup": true
            "SystemdCgroup": true
    "systemdCgroup": false,
    "disableCgroup": false,

Here is our k8s version:

$ k version --short
Client Version: v1.21.3
Server Version: v1.22.9
gengwg commented 1 year ago

@gengwg Thanks for sharing your document. As I run my kubernetes cluster on ubuntu 22.04, cgroupv2 is the default cgroup subsystem used.

I deployed two environments to help me making some comparisons:

  • One environment is running kubernetes v1.25.3, with Nvidia's GPU operator
  • One environment with only containerd & nvidia-container-toolkit

Interestingly, I never face this issue on the second environment, everything is running perfectly well.

The first environment though is running into this issue after some time.

That would probably means that Nvidia's container runtime isn't the faulty component here, but it needs more investigations on my side to be sure that I'm not missing anything.

I'll have a look at the cgroup driver as @panli889 mentioned.

Thanks again for your help

I think ours is similar to your 2nd env, i.e. containerd & nvidia-container-toolkit. we are on k8s v1.22.9.

# containerd --version
containerd containerd.io 1.6.6 10c12954828e7c7c9b6e0ea9b0c02b01407d3ae1

# dnf info nvidia-container-toolkit | grep Version
Version      : 1.11.0

i posted cgroup driver info above.

panli889 commented 1 year ago

@gengwg thx for your reply!

cgroup driver for kubelet, docker and containerd are all systemd.

Hmm, that's interesting, it's quite different from my situation. Would you please share your systemd version?

I can share the problems we meet, if we create a pod with GPU, there will be a related systemd scope created at the same time like cri-containerd-xxxxxx.scope, and it records the cgroup info, if we run systemctl status to check the status:

Warning: The unit file, source configuration file or drop-ins of cri-containerd-xxxxx.scope changed on disk. Run 'systemctl daemon-reload' to  reload units.
● cri-containerd-xxx.scope - libcontainer container xxxx
     Loaded: loaded (/run/systemd/transient/cri-containerd-xxxx.scope; transient)
  Transient: yes
    Drop-In: /run/systemd/transient/cri-containerd-xxxxx.scope.d
             └─50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUWeight.conf, 50-CPUQuotaPeriodSec.conf, 50-CPUQuota.conf, 50-AllowedCPUs.conf
     Active: active (running) since Fri 2022-11-25 12:13:33 +08; 1min 47s ago
         IO: 404.0K read, 0B written
      Tasks: 1
     Memory: 528.0K
        CPU: 2.562s
     CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb6b36d39_ef5b_4eb9_850d_d710bbd06096.slice/cri-containerd-xxx.scope>
             └─61265 sleep infinity

And if we check the content of file 50-DeviceAllow.conf, we found no GPU devices info in there. Then if we run systemctl daemon-reload, if will reproduce a ebpf cgroup program about the devices, and it will block the access of GPU devices.

So would you please also take a look at the content of DeviceAllow.conf for some systemd scope of pod, what's in there?

numerical2017 commented 1 year ago

Same issue with 2 x Nvidia 3090 Ti, Ubuntu 22.04.1 LTS, Driver Version: 510.85.02, CUDA Version: 11.6 I adopted the solution proposed by @kevin-bockman downgrading containerd.io from 1.6.10 to 1.6.6. After running systemctl daemon-reload on the host machine the nvidia-smi within the container still works properly. I will check how long it lasts and I'll keep you updated.

fradsj commented 1 year ago

@panli889 I checked the scope unit with systemctl status, and this message popped up:

Warning: The unit file, source configuration file or drop-ins of cri-containerd-d35333ac42f1e08a33632fccd63028a28443f95f3c126860a8c9da20b6d27102.scope changed on disk. Run 'systemctl daemon-reload' to reload units.

After running systemctl daemon-reload, I get the error on my container:

root@ubuntu:/# nvidia-smi
Failed to initialize NVML: Unknown Error

Here's the content of the 50-DeviceAllow.conf file:

[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

There's indeed no reference to nvidia's devices here:

crw-rw-rw- 1 root root 195, 254 Nov 29 10:18 nvidia-modeset
crw-rw-rw- 1 root root 234,   0 Nov 29 10:18 nvidia-uvm
crw-rw-rw- 1 root root 234,   1 Nov 29 10:18 nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Nov 29 10:18 nvidia0
crw-rw-rw- 1 root root 195, 255 Nov 29 10:18 nvidiactl

nvidia-caps:
total 0
cr-------- 1 root root 237, 1 Nov 29 10:18 nvidia-cap1
cr--r--r-- 1 root root 237, 2 Nov 29 10:18 nvidia-cap2
panli889 commented 1 year ago

@fradsj thanks for your reply, seems the same problem as us.

Here is how we solve it, hope it will help:

Navino16 commented 1 year ago

Hi,

Any official way to fix this error ?

klueska commented 1 year ago

The official way is in the works.

It is based on using a new specification called CDI to do the GPU device injection, rather than relying a runc hook to do the GPU device injection behind the back of containerd (which is a fundamental / architectural flaw of the existing nvidia-container-runtime, and is the underlying cause of all these problems).

Until a version of both (1) the nvidia-container-runtime and (2) the k8s-device-plugin are released with proper support for CDI, you will need to rely on one of the workarounds described here.

There is no "official" workaround as such, but the workaround described in https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1330466432 seems like the best one from my perspective. It relies on the already documented use of --pass-device-specs=true in the k8s-device-plugin (which has been the workaround for years until now) combined with downgrading to a version of runc which doesn't trigger the GPUs to be ignored.

gengwg commented 1 year ago

Hmm, that's interesting, it's quite different from my situation. Would you please share your systemd version?

I can share the problems we meet, if we create a pod with GPU, there will be a related systemd scope created at the same time like cri-containerd-xxxxxx.scope, and it records the cgroup info, if we run systemctl status to check the status:

Warning: The unit file, source configuration file or drop-ins of cri-containerd-xxxxx.scope changed on disk. Run 'systemctl daemon-reload' to  reload units.
● cri-containerd-xxx.scope - libcontainer container xxxx
     Loaded: loaded (/run/systemd/transient/cri-containerd-xxxx.scope; transient)
  Transient: yes
    Drop-In: /run/systemd/transient/cri-containerd-xxxxx.scope.d
             └─50-DevicePolicy.conf, 50-DeviceAllow.conf, 50-CPUWeight.conf, 50-CPUQuotaPeriodSec.conf, 50-CPUQuota.conf, 50-AllowedCPUs.conf
     Active: active (running) since Fri 2022-11-25 12:13:33 +08; 1min 47s ago
         IO: 404.0K read, 0B written
      Tasks: 1
     Memory: 528.0K
        CPU: 2.562s
     CGroup: /kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podb6b36d39_ef5b_4eb9_850d_d710bbd06096.slice/cri-containerd-xxx.scope>
             └─61265 sleep infinity

And if we check the content of file 50-DeviceAllow.conf, we found no GPU devices info in there. Then if we run systemctl daemon-reload, if will reproduce a ebpf cgroup program about the devices, and it will block the access of GPU devices.

So would you please also take a look at the content of DeviceAllow.conf for some systemd scope of pod, what's in ther

@panli889 sorry for late reply. was on vacation.

systemd version:

$ systemctl --version
systemd 239 (239-58.el8)

After spinning up a pod on a node:

$ k exec -it gengwg-test-gpu-9 -- nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-3836675c-e987-1f01-7ce7-12da20038909)

I don't see the systemd scope nor the DeviceAllow files.

$ find /etc/systemd/ | grep scope
$ sudo find /etc/ | grep -i DeviceAllow
gengwg commented 1 year ago

Checked those on our env.

Here is how we solve it, hope it will help:

We didn't use the --pass-device-specs=true option, but we do have allowPrivilegeEscalation: false. looks not the same thing.

$ k get ds nvidia-device-plugin-daemonset -n kube-system -o yaml
....
    spec:
      containers:
      - args:
        - --fail-on-init-error=false
        image: xxxxx.com/k8s-device-plugin:v0.9.0
        imagePullPolicy: IfNotPresent
        name: nvidia-device-plugin-ctr
        resources: {}
        securityContext:
          allowPrivilegeEscalation: false # <------
          capabilities:
            drop:
            - ALL
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /var/lib/kubelet/device-plugins
          name: device-plugin
      dnsPolicy: ClusterFirst
....

Luckily we are right below 1.1.3. We pinned the version on the repo side through centos composes, so this should be safe if we do not advance the compose version.

$ runc --version
runc version 1.1.2
commit: v1.1.2-0-ga916309
spec: 1.0.2-dev
go: go1.17.11
libseccomp: 2.5.2
fradsj commented 1 year ago

@panli889 running the device plugin with runc in v1.1.2 seems to fix the situation, as the GPUs are listed in the DeviceAllow file of the cgroup of the container:

[Scope] DeviceAllow= DeviceAllow=/dev/char/195:255 rw DeviceAllow=/dev/char/195:0 rw DeviceAllow=char-pts rwm DeviceAllow=/dev/char/10:200 rwm DeviceAllow=/dev/char/5:2 rwm DeviceAllow=/dev/char/5:0 rwm DeviceAllow=/dev/char/1:9 rwm DeviceAllow=/dev/char/1:8 rwm DeviceAllow=/dev/char/1:7 rwm DeviceAllow=/dev/char/1:5 rwm DeviceAllow=/dev/char/1:3 rwm DeviceAllow=char-* m DeviceAllow=block-* m

Thank you very much for your help !

@klueska that's suprising to see that Nvidia's GPUs are not listed in the /dev/char directory, as runc is expecting to find it in. Do you know if that's expected by Nvidia's drivers developers ?

For the CDI, do you know if the kubernetes community is working with you on this, and if there's any release cycle that has been decided yet ?

Thank you very much.

klueska commented 1 year ago

I was able to reproduce this and verify that manually creating symlinks to the various nvidia devices in /dev/char resolves the issue. I need to talk to our driver team to determine why these are not automatically created and how to get them created going forward.

At least we seem to fully understand the problem now, and know what is necessary to resolve it. In the meantime, I would recommend creating these symlinks manually to work around this issue.

KyonP commented 1 year ago

having almost same issue with Quadro RTX 8000 cluster server.

I hope there is quick solution before the official fix.

I have to keep restart my docker container whenever I have this issue

superbrothers commented 1 year ago

GPU Operator seems to have had a release that contained a workaround. https://github.com/NVIDIA/gpu-operator/issues/430#issuecomment-1413119124

Since I am not using GPU Operator, I have a small tool that does the same thing. I can confirm that this solves the problem in my environment. https://gist.github.com/superbrothers/5bbb80e15a7f3ad994f789165dce2938

klueska commented 1 year ago

A tool will be shipping with the next release of the nvidia container toolkit later today. I’ll update here with instructions (or point at the official documentation of its ready by then).

saeejithnair commented 1 year ago

A tool will be shipping with the next release of the nvidia container toolkit later today. I’ll update here with instructions (or point at the official documentation of its ready by then).

Hi @klueska, can you point me to the tool/instructions for resolving this issue? Thanks!

klueska commented 1 year ago
  1. Using the nvidia-ctk utility: The NVIDIA Container Toolkit v1.12.0 includes a utility for creating symlinks in /dev/char for all possible NVIDIA device nodes required for using GPUs in containers. This can be run as follows:
    sudo nvidia-ctk system create-dev-char-symlinks \
    --create-all

    This command should be configured to run at boot on each node where GPUs will be used in containers. It requires that the NVIDIA driver kernel modules have been loaded at the point where it is run.

A simple udev rule to enforce this can be seen below:

# This will create /dev/char symlinks to all device nodes
ACTION=="add", DEVPATH=="/bus/pci/drivers/nvidia", RUN+="/usr/bin/nvidia-ctk system create-dev-char-symlinks --create-all"

A good place to install this rule would be: /lib/udev/rules.d/71-nvidia-dev-char.rules

In cases where the NVIDIA GPU Driver Container is used, the path to the driver installation must be specified. In this case the command should be modified to:

sudo nvidia-ctk system create-dev-symlinks \
--create-all \
–-driver-root={{NVIDIA_DRIVER_ROOT}}

Where {{NVIDIA_DRIVER_ROOT}} is the path to which the NVIDIA GPU Driver container installs the NVIDIA GPU driver and creates the NVIDIA Device Nodes.

  1. Explicitly disabling systemd cgroup management in Docker: Set the parameter "exec-opts": ["native.cgroupdriver=cgroupfs"] in the /etc/docker/daemon.json file and restart docker.

  2. Downgrading to docker.io packages where systemd is not the default cgroup manager (and not overriding that of course).

mbentley commented 1 year ago

I'm going down the route of using option 1 - using nvidia-ctk as I am running standalone Docker on Debian 11 (bullseye). I've added a udev rule but I haven't rebooted to see if it runs but I have manually executed nvidia-ctk system create-dev-char-symlinks --create-all and it's created the symlinks. I'm using the driver packages directly from the Debian repos, not the GPU Driver container. If I run systemctl daemon-reload, it continues to trigger the same behavior as before where I see Failed to initialize NVML: Unknown Error messages. I've re-created my GPU containers. Is there something I am missing? Does Docker need to be restarted or is there something about the specific order of what needs to happen when outside of the kernel module needing to be loaded before creating the symlinks?

klueska commented 1 year ago

Can you show me your docker command?

Note: this does not address the issue where you still need to explicitly pass the device nodes for /dev/nvidia0, /dev/nvidia1, /dev/nvidiactl on the command line (that won’t be fixed until CDI support is added to docker).

This fixes the issue where — even if you do explicitly pass the device nodes — you STILL lose access to the GPUs on a systemctl daemon reload.

mbentley commented 1 year ago

Sure thing:

docker run -d \
  --restart unless-stopped \
  --name nvidia-smi-rest \
  --gpus 'all,"capabilities=utility"' \
  --cpus 1 \
  --memory 1g \
  --memory-swap 1.5g \
  mbentley/nvidia-smi-rest

/etc/docker/daemon.json:

{
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "storage-driver": "overlay2"
}
klueska commented 1 year ago

With systemd cgroup management you must always pass the nvidia device nodes on the docker command line (which you are not doing).

Meaning you would need to run:

docker run -d \
  --restart unless-stopped \
  --name nvidia-smi-rest \
  --gpus 'all,"capabilities=utility"' \
  --device /dev/nvidiactl \
  --device /dev/nvidia0 \
  ...
  --cpus 1 \
  --memory 1g \
  --memory-swap 1.5g \
  mbentley/nvidia-smi-rest

This is due to the way GPU injection currently happens from within a runc hook when the --gpus flag is used. The hook manually sets up the cgroups for the NVIDIA devices behind the back of docker/containerd/runc -- so when a systemd daemone-reload happens the cgroup access for these devices gets undone (because these runtimes had no way of telling systemd that these devices had been injected by the hook and the reload triggers it to reevaluate all cgroup rules).

This issue only started to be noticed by most people recently because the latest release of docker flipped to using systemd cgroup management by default (as opposed to cgroupfs).

The good news is, once CDI support is added to docker, this won't be necessary anymore. https://github.com/docker/cli/issues/3864

mbentley commented 1 year ago

Ah, thanks @klueska - makes sense and works as expected. Thanks again!

cdrcnm commented 1 year ago

The fix with the /dev/char symlink creation works fine, thanks. But now we also need to set PASS_DEVICE_SPECS=true which wasn't the case before. From the documentation it was only needed if we wanted to interoperate with the CPUManager in Kubernetes, and requieres to deploy the daemonset with elevated privileges. Why is setting this var needed ?

klueska commented 1 year ago

@cdrcnm yes, that is now necessary and the documentation should be updated. It's needed now for the same reasons described in my comment above: https://github.com/NVIDIA/nvidia-docker/issues/1671#issuecomment-1420855027.

Note: this is an unfortunate truth for the moment and will go away once CDI becomes the standard for device injection in containerized environments (and we update the device plugin to support CDI as well). CDI support has already been added to cri-o and containerd and we are in the process of making the nvidia device plugin CDI aware. Once all the pieces are in place we will update our documentation to instruct people on how to use it.

dcarrion87 commented 1 year ago

@klueska ran into this after we fixed a similar containerd/runc issue.

We're running Kubernetes on A100s where the DGXOS distribution doesn't bake in 1.12.X of the ctk.

Is there any other options that doesn't involve manual char device creation to get people over the line?

We'll probably end up upgrading the gpu operator but it's going to be breaking between the version we currently run and the version this suggests so thinking about doing workaround first and planning that out further.

dcarrion87 commented 1 year ago

Hmmm I did a little script to create the device links:

BASE=/dev/char
for d in $( cd $BASE && find ../nvidia* -type c ); do
  MAJOR_HEX=$(stat -c %t $BASE/$d)
  MINOR_HEX=$(stat -c %T $BASE/$d)
  MAJOR_DEC=$((16#$MAJOR_HEX))
  MINOR_DEC=$((16#$MINOR_HEX))
  ln -s $d /dev/char/${MAJOR_DEC}:${MINOR_DEC}
done

The bounced k3s / containerd and container.

But still get this in the container after daemon-reload:

$ nvidia-smi
Failed to initialize NVML: Unknown Error

Our environment is running k3s with containerd and gpu operator 1.11.1. We use the accept-nvidia-visible-devices-as-volume-mounts feature of the container runtime on each host to allow a pod to share devices between containers in the same pod.

dcarrion87 commented 1 year ago

Actually symbolic links do work but only for the container that originally gets the GPU devices.

It just drops out on the sidecar container which shares the GPU by reading in the GPU devices from config map that the main container writes to on startup. See here on how we use: https://github.com/harrison-ai/cobalt-docker-rootless-nvidia-dind

Would I need to manually adjust an allow list so it doesn't drop GPU device on the sidecar when there's a daemon-reload? We actually don't care about cgroup control for these devices. It's just about soft blocking them so users don't trip over each other.

dind container ( main):

cat /sys/fs/cgroup/devices/devices.list
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm
c 195:0 rw
c 195:254 rw
c 195:255 rw
c 511:0 rw
c 511:1 rw

workspace container (secondary):

cat /sys/fs/cgroup/devices/devices.list
b *:* m
c *:* m
c 1:3 rwm
c 1:5 rwm
c 1:7 rwm
c 1:8 rwm
c 1:9 rwm
c 5:0 rwm
c 5:2 rwm
c 10:200 rwm
c 136:* rwm

I can manually echo into the cgroup devices.allow and things start working again but that's not ideal.

gaopeiliang commented 1 year ago

we found this case also when we upgrade ubuntu16 (kernel 4.9) to ubuntu 20 (kernel 5.4)!

docker version 20.10.7
containerd version 1.4.6 runc version rc-95 native.cgroupdriver=systemd (docker and k8s recommand a long time , I think mostly cluster use it)


Nothing change , Why systemctl daemon-reload container lose GPU Device ........

I notice systemd version in runc lastest version issue https://github.com/opencontainers/runc/issues/3708

from ubuntu16 kernel 4.9 to ubuntu 20 kernel 5.4 , systemd version upgrade from systemd 229 to systemd 245!

ubuntu 16  (kernel 4.9)    systemd 229    cgroup v1

ubuntu 20  (kernel 5.4)    systemd 245    cgroup  v1 (default-hierarchy=hybrid)

So there are 3 main factor :

  1. device plugin --pass-device options
  2. runc version
  3. systemd version

I test it with same case found diff systemd version diff result to handle system scope config with cgoup config when daemon-reload

  1. systemd add device A, device A can not find with stat(2), cgroup add device A when systemctl daemon-reload: a. systemd 229 clear cgroup device A b. systemd 245 do nothing

  2. systemd add device A, device A cat find with stat(2), cgroup add device A when systemctl daemon-reload: a. systemd 229 do nothing b. systemd 245 do nothing

  3. systemd not add device A, device A stat(2) do not care (find or not find), cgroup add device A when systemctl daemon-reload: a. systemd 229 do nothing b. systemd 245 clear cgroup device A

With this special different system result:

we k8s cluster with --pass-device=false systemd 229 runc rc-95 should meet case 3, so systemctl daemon-reload work success ! but we upgrade to systemd 245, systemctl daemon-reload break container device list


Of course , different runc version how to handle Device with Systemd make this issue more mystery ! eg:

  1. before runc rc92, runc do not sync device with systemd
  2. should add an not existed device path to systemd? https://github.com/opencontainers/runc/issues/3671
    (has been fix in this issue issue https://github.com/opencontainers/runc/issues/3708 to check systemd version 240 , maybe start systemd 240 change some .. )


with more clear , I draw an map about it, maybe help

图片

elezar commented 1 year ago

There is an issue out against runc discussed here https://github.com/opencontainers/runc/issues/3708#issuecomment-1523533029 that also discusses this. According to the author there were fixes merged into both main and release-1.1. Do your experiments contain these fixes?

I verified them yesterday, although I always passed device nodes in my tests.

gaopeiliang commented 1 year ago

There is an issue out against runc discussed here opencontainers/runc#3708 (comment) that also discusses this. According to the author there were fixes merged into both main and release-1.1. Do your experiments contain these fixes?

I verified them yesterday, although I always passed device nodes in my tests.

The new released version runc 1.1.7, fix about how to handle /dev/char/xx existed or not .....

with this new fixes;

 `pass-device`  +  `/dev/char/xx not existed` +  `systemd 229 ( < 240)`       reload success 

 `pass-device`  +  `/dev/char/xx not existed` +  `systemd 245 ( >= 240)`    reload success

 `pass-device`  +  `/dev/char/xx existed` +  `systemd 229 ( < 240)`    reload success

 `pass-device`  + "/dev/char/xx existed" +  `systemd 245 ( >= 240)`  reload success

so with pass-device = true option, Nvidia GPU Driver there no need to create link /dev/char/xx ;

but when pass-device=false option, when used systemd 245 (>=240) , all runc (>= rc92) reload failed !


update map abount new runc version (1.1.7)

图片