NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.3k stars 246 forks source link

nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1 #154

Open Dan-Burns opened 1 year ago

Dan-Burns commented 1 year ago

Hello,

I tried the different combinations of conda and pip packages that people suggest to get tensorflow running for the rtx 30 series. Thought it was working after utilizing the gpu with keras tutorial code but moved to a different type of model and something apparently broke.

Now I'm trying the docker route. docker run --gpus all -it --rm nvcr.io/nvidia/tensorflow:22.11-tf2-py3 docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown. There seems to be a lot of missing libraries.

3. Information to attach (optional if deemed irrelevant)

Device Index: 0 Device Minor: 0 Model: NVIDIA GeForce RTX 3090 Ti Brand: GeForce GPU UUID: GPU-ba9fdcdb-8a2b-d2b6-f69c-5f2ac08dde8b Bus Location: 00000000:01:00.0 Architecture: 8.6 I1202 15:15:34.468151 26518 nvc.c:434] shutting down library context I1202 15:15:34.468317 26521 rpc.c:95] terminating nvcgo rpc service I1202 15:15:34.469397 26518 rpc.c:132] nvcgo rpc service terminated successfully I1202 15:15:34.474156 26520 rpc.c:95] terminating driver rpc service I1202 15:15:34.474599 26518 rpc.c:132] driver rpc service terminated successfully

Timestamp : Fri Dec 2 09:17:13 2022 Driver Version : 520.56.06 CUDA Version : 11.8

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : NVIDIA GeForce RTX 3090 Ti Product Brand : GeForce Product Architecture : Ampere Display Mode : Enabled Display Active : Enabled Persistence Mode : Enabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-ba9fdcdb-8a2b-d2b6-f69c-5f2ac08dde8b Minor Number : 0 VBIOS Version : 94.02.A0.00.2D MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Module ID : 0 Inforom Version Image Version : G002.0000.00.03 OEM Object : 2.0 ECC Object : 6.16 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x220310DE Bus Id : 00000000:01:00.0 Sub System Id : 0x88701043 GPU Link Info PCIe Generation Max : 4 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 1000 KB/s Rx Throughput : 0 KB/s Fan Speed : 0 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 24564 MiB Reserved : 310 MiB Used : 510 MiB Free : 23742 MiB BAR1 Memory Usage Total : 256 MiB Used : 13 MiB Free : 243 MiB Compute Mode : Default Utilization Gpu : 6 % Memory : 5 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Disabled Pending : Disabled ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows Correctable Error : 0 Uncorrectable Error : 0 Pending : No Remapping Failure Occurred : No Bank Remap Availability Histogram Max : 192 bank(s) High : 0 bank(s) Partial : 0 bank(s) Low : 0 bank(s) None : 0 bank(s) Temperature GPU Current Temp : 36 C GPU Shutdown Temp : 97 C GPU Slowdown Temp : 94 C GPU Max Operating Temp : 92 C GPU Target Temperature : 83 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 32.45 W Power Limit : 480.00 W Default Power Limit : 480.00 W Enforced Power Limit : 480.00 W Min Power Limit : 100.00 W Max Power Limit : 516.00 W Clocks Graphics : 210 MHz SM : 210 MHz Memory : 405 MHz Video : 555 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2115 MHz SM : 2115 MHz Memory : 10501 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 740.000 mV Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 2283 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 259 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 2441 Type : G Name : /usr/bin/gnome-shell Used GPU Memory : 52 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 3320 Type : G Name : /opt/docker-desktop/Docker Desktop --type=gpu-process --enable-crashpad --enable-crash-reporter=46721d59-e3cc-4241-8f96-57bab71f8674,no_channel --user-data-dir=/home/kanaka/.config/Docker Desktop --gpu-preferences=WAAAAAAAAAAgAAAIAAAAAAAAAAAAAAAAAABgAAAAAAA4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAIAAAAAAAAAABAAAAAAAAAAgAAAAAAAAACAAAAAAAAAAIAAAAAAAAAA== --shared-files --field-trial-handle=0,i,777493636119283380,17735576311253417080,131072 --disable-features=SpareRendererForSitePerProcess Used GPU Memory : 27 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 4402 Type : C+G Name : /opt/google/chrome/chrome --type=gpu-process --enable-crashpad --crashpad-handler-pid=4367 --enable-crash-reporter=, --change-stack-guard-on-fork=enable --gpu-preferences=WAAAAAAAAAAgAAAIAAAAAAAAAAAAAAAAAABgAAEAAAA4AAAAAAAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAIAAAAAAAAAABAAAAAAAAAAgAAAAAAAAACAAAAAAAAAAIAAAAAAAAAA== --shared-files --field-trial-handle=0,i,1352372760819385498,10632265477078674372,131072 Used GPU Memory : 166 MiB

Server: Docker Desktop 4.15.0 (93002) Engine: Version: 20.10.21 API version: 1.41 (minimum version 1.12) Go version: go1.18.7 Git commit: 3056208 Built: Tue Oct 25 18:00:19 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.10 GitCommit: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661 runc: Version: 1.1.4 GitCommit: v1.1.4-0-g5fd4c4d docker-init: Version: 0.19.0 GitCommit: de40ad0

elezar commented 1 year ago

The toolkit explicitly looks for libnvidia-ml.so.1 which should be symlinked to libnvidia-mk.so.<DRIVER_VERSION> after running ldconfig on your host. Since nvidia-smi works (and also uses libnvidia-ml.so.1), I would not expect this to be the case.

How is docker installed, could it be that it is installed as a snap and cannot load the system libraries because of this?

Dan-Burns commented 1 year ago

I installed docker-desktop after following the "docker engine" link on https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow

johndpope commented 1 year ago

same problem ubuntu 22:04

Linux msi 5.15.0-56-generic NVIDIA/nvidia-docker#62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

docker desktop

can you unpack this?

The toolkit explicitly looks for libnvidia-ml.so.1 which should be symlinked to libnvidia-mk.so. after running ldconfig on your host. Since nvidia-smi works (and also uses libnvidia-ml.so.1), I would not expect this to be the case.

How is docker installed, could it be that it is installed as a snap and cannot load the system libraries because of this?

I installed

sudo apt-get install -y nvidia-docker2

successfully nvidia-docker2 is already the newest version (2.11.0-1).

Mon Dec  5 18:59:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   57C    P8    29W / 370W |   1010MiB / 24576MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1515      G   /usr/lib/xorg/Xorg                548MiB |
|    0   N/A  N/A      1649      G   /usr/bin/gnome-shell              234MiB |
|    0   N/A  N/A     19695      G   ...RendererForSitePerProcess       32MiB |
|    0   N/A  N/A     19769    C+G   ...192290595412440874,131072      191MiB |
+-----------------------------------------------------------------------------+

nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I1205 08:00:00.132727 24945 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977)
I1205 08:00:00.132797 24945 nvc.c:350] using root /
I1205 08:00:00.132806 24945 nvc.c:351] using ldcache /etc/ld.so.cache
I1205 08:00:00.132819 24945 nvc.c:352] using unprivileged user 29999:29999
I1205 08:00:00.132844 24945 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1205 08:00:00.133009 24945 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W1205 08:00:00.134346 24946 nvc.c:273] failed to set inheritable capabilities
W1205 08:00:00.134424 24946 nvc.c:274] skipping kernel modules load due to failure
I1205 08:00:00.134891 24947 rpc.c:71] starting driver rpc service
I1205 08:00:00.142782 24948 rpc.c:71] starting nvcgo rpc service
I1205 08:00:00.143811 24945 nvc_info.c:766] requesting driver information with ''
I1205 08:00:00.145644 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.525.60.11
I1205 08:00:00.145731 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.525.60.11
I1205 08:00:00.145778 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.525.60.11
I1205 08:00:00.145821 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.525.60.11
I1205 08:00:00.145877 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.525.60.11
I1205 08:00:00.145930 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.525.60.11
I1205 08:00:00.145970 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.525.60.11
I1205 08:00:00.146007 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.525.60.11
I1205 08:00:00.146066 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.525.60.11
I1205 08:00:00.146105 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.525.60.11
I1205 08:00:00.146144 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.525.60.11
I1205 08:00:00.146183 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.525.60.11
I1205 08:00:00.146236 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.525.60.11
I1205 08:00:00.146288 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.525.60.11
I1205 08:00:00.146325 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.525.60.11
I1205 08:00:00.146366 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.525.60.11
I1205 08:00:00.146418 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.525.60.11
I1205 08:00:00.146475 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.525.60.11
I1205 08:00:00.146752 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.525.60.11
I1205 08:00:00.146788 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.525.60.11
I1205 08:00:00.146943 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.525.60.11
I1205 08:00:00.146977 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.525.60.11
I1205 08:00:00.147011 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.525.60.11
I1205 08:00:00.147046 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.525.60.11
I1205 08:00:00.147106 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.525.60.11
I1205 08:00:00.147140 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.525.60.11
I1205 08:00:00.147186 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.525.60.11
I1205 08:00:00.147236 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.525.60.11
I1205 08:00:00.147271 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.525.60.11
I1205 08:00:00.147319 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.525.60.11
I1205 08:00:00.147350 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.525.60.11
I1205 08:00:00.147385 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.525.60.11
I1205 08:00:00.147417 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.525.60.11
I1205 08:00:00.147465 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.525.60.11
I1205 08:00:00.147515 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.525.60.11
I1205 08:00:00.147547 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.525.60.11
I1205 08:00:00.147582 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.525.60.11
I1205 08:00:00.147649 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.525.60.11
I1205 08:00:00.147707 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.525.60.11
I1205 08:00:00.147741 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.525.60.11
I1205 08:00:00.147775 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.525.60.11
I1205 08:00:00.147811 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.525.60.11
W1205 08:00:00.147830 24945 nvc_info.c:399] missing library libnvidia-nscq.so
W1205 08:00:00.147836 24945 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W1205 08:00:00.147842 24945 nvc_info.c:399] missing library libnvidia-pkcs11.so
W1205 08:00:00.147847 24945 nvc_info.c:399] missing library libvdpau_nvidia.so
W1205 08:00:00.147854 24945 nvc_info.c:399] missing library libnvidia-ifr.so
W1205 08:00:00.147859 24945 nvc_info.c:399] missing library libnvidia-cbl.so
W1205 08:00:00.147867 24945 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W1205 08:00:00.147873 24945 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W1205 08:00:00.147878 24945 nvc_info.c:403] missing compat32 library libcudadebugger.so
W1205 08:00:00.147887 24945 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W1205 08:00:00.147893 24945 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W1205 08:00:00.147899 24945 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W1205 08:00:00.147904 24945 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W1205 08:00:00.147910 24945 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W1205 08:00:00.147916 24945 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W1205 08:00:00.147921 24945 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W1205 08:00:00.147926 24945 nvc_info.c:403] missing compat32 library libnvoptix.so
W1205 08:00:00.147932 24945 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I1205 08:00:00.148532 24945 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I1205 08:00:00.148551 24945 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I1205 08:00:00.148569 24945 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I1205 08:00:00.148598 24945 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I1205 08:00:00.148615 24945 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W1205 08:00:00.148707 24945 nvc_info.c:425] missing binary nv-fabricmanager
W1205 08:00:00.148735 24945 nvc_info.c:349] missing firmware path /lib/firmware/nvidia/525.60.11/gsp.bin
I1205 08:00:00.148762 24945 nvc_info.c:529] listing device /dev/nvidiactl
I1205 08:00:00.148767 24945 nvc_info.c:529] listing device /dev/nvidia-uvm
I1205 08:00:00.148775 24945 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I1205 08:00:00.148781 24945 nvc_info.c:529] listing device /dev/nvidia-modeset
I1205 08:00:00.148809 24945 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W1205 08:00:00.148831 24945 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W1205 08:00:00.148847 24945 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I1205 08:00:00.148851 24945 nvc_info.c:822] requesting device information with ''
I1205 08:00:00.155221 24945 nvc_info.c:713] listing device /dev/nvidia0 (GPU-94c5d11e-e574-eefc-2db6-08e204f9e1a4 at 00000000:01:00.0)
NVRM version:   525.60.11
CUDA version:   12.0

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce RTX 3090
Brand:          GeForce
GPU UUID:       GPU-94c5d11e-e574-eefc-2db6-08e204f9e1a4
Bus Location:   00000000:01:00.0
Architecture:   8.6
I1205 08:00:00.155235 24945 nvc.c:434] shutting down library context
I1205 08:00:00.155296 24948 rpc.c:95] terminating nvcgo rpc service
I1205 08:00:00.155542 24945 rpc.c:135] nvcgo rpc service terminated successfully
I1205 08:00:00.156623 24947 rpc.c:95] terminating driver rpc service
I1205 08:00:00.156671 24945 rpc.c:135] driver rpc service terminated successfully
johndpope commented 1 year ago

Screenshot from 2022-12-05 22-35-34 not sure it helps - I had originally installed the driver from cuda 11.8 - but then when I did nvidia-docker2 install - the driver broke - so i reverted back to the system (auto install) driver.

UPDATE

reading through docs - for https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

this command works fine....


sudo docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.8.0-base-ubuntu22.04' locally
11.8.0-base-ubuntu22.04: Pulling from nvidia/cuda
301a8b74f71f: Already exists 
35985d37d899: Already exists 
5b7513e7876e: Already exists 
bbf319bc026c: Already exists 
da5c9c5d5ac3: Already exists 
Digest: sha256:83493b3f150cc23f91fb0d2509e491204e33f062355d401662389a80a9091b82
Status: Downloaded newer image for nvidia/cuda:11.8.0-base-ubuntu22.04
Mon Dec  5 23:05:46 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   44C    P8    25W / 370W |    995MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

ok

it's basically a problem without using sudo...


docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi 
Unable to find image 'nvidia/cuda:11.8.0-base-ubuntu22.04' locally
11.8.0-base-ubuntu22.04: Pulling from nvidia/cuda
301a8b74f71f: Already exists 
35985d37d899: Already exists 
5b7513e7876e: Already exists 
bbf319bc026c: Already exists 
da5c9c5d5ac3: Already exists 
Digest: sha256:83493b3f150cc23f91fb0d2509e491204e33f062355d401662389a80a9091b82
Status: Downloaded newer image for nvidia/cuda:11.8.0-base-ubuntu22.04
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

UPDATE - FIXED. I don't know if this helps - but on my installation I had cudnn-local-repo-ubuntu2204-8.6.0.163_1.0-1_amd64.deb + 11.8 cuda this is incorrect. i was using cog - and this didn't find the error - just assumed it was all working correctly. updating to latest cudnn - resolved my original issue. cudnn-local-repo-ubuntu2204-8.7.0.84_1.0-1_amd64.deb

groucho64738 commented 1 year ago

I'm having a similar issue on a system I'm using for K8S, and no containers can run that require nvidia drivers with the same error (about libnvidia-ml.so.1). I'm not sure what specific steps broke it for me though. I was able to reproduce the error message on a command line by running the cuda container directly on our node: docker run --gpus=all --runtime=nvidia nvidia/cuda:11.8.0-base-ubuntu20.04 nvidia-smi

I created a debug log for nvidia-container-toolkit:

I1212 19:31:12.254613 192312 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I1212 19:31:12.254655 192312 nvc.c:350] using root /run/nvidia/driver
I1212 19:31:12.254660 192312 nvc.c:351] using ldcache /etc/ld.so.cache
I1212 19:31:12.254665 192312 nvc.c:352] using unprivileged user 65534:65534
I1212 19:31:12.254683 192312 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1212 19:31:12.254778 192312 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I1212 19:31:12.267767 192321 nvc.c:278] loading kernel module nvidia
E1212 19:31:12.267872 192321 nvc.c:280] could not load kernel module nvidia
I1212 19:31:12.267883 192321 nvc.c:296] loading kernel module nvidia_uvm
E1212 19:31:12.267904 192321 nvc.c:298] could not load kernel module nvidia_uvm
I1212 19:31:12.267914 192321 nvc.c:305] loading kernel module nvidia_modeset
E1212 19:31:12.267934 192321 nvc.c:307] could not load kernel module nvidia_modeset
I1212 19:31:12.268200 192322 rpc.c:71] starting driver rpc service
I1212 19:31:12.268825 192312 rpc.c:135] driver rpc service terminated with signal 15
I1212 19:31:12.268870 192312 nvc.c:434] shutting down library context

Not a lot of help there. If I run nvidia-container-cli -k -d /dev/tty info I get a list of all of the modules and libraries, so that functions. I've tried running the container in privileged mode as well and still get the same error. Each time, I'm root when trying to kick off the container to eliminate that piece as well.

This is an ubuntu 20.04 system, running docker 20.10.18. I've followed the installation directions in the install guide (pretty straightforward to follow)

If there any suggestions of what else to try to debug, I'm willing to give it a try. This has been a real headache.

groucho64738 commented 1 year ago

I actually managed to fix this. At some point in time we had uncommented the option root = "/run/nvidia/driver" in /etc/nvidia-container-runtime/config.toml (must have seen directions on this somewhere). My best guess is that we had updated something on the system that made this no longer be a viable option, and after a reboot, everything stopped working. I commented out that option and everything popped up.

To find it, I created a wrapper around nvidia-container-cli:

#!/bin/bash

echo "$@" > /var/tmp/debuginfo
/usr/bin/nvidia-container-cli.real "$@"

That showed me a working and a non-working systems' option that were being passed.

Not working:

--root=/run/nvidia/driver --load-kmods --debug=/var/log/nvidia-container-toolkit.log configure [--ldconfig=@/sbin/ldconfig.real](mailto:--ldconfig=@/sbin/ldconfig.real) --device=all --compute --utility --require=cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516 --pid=3895576 /var/lib/docker/overlay2/47f7deb4479aa6b8c26f3b6e3ad4a2cd9bd86304736bf9aed68ed4127fbc0d00/merged

Working:

--load-kmods configure [--ldconfig=@/sbin/ldconfig.real](mailto:--ldconfig=@/sbin/ldconfig.real) --device=all --compute --utility --require=cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516 --pid=2830327 /var/lib/docker/overlay2/59206c16f5a12eadbe2e42287a7ff6aa3559b0666048d7578b29df90e3755d50/merged
johndpope commented 1 year ago

From the look of it - first line is 470 second is 511. It does seem like everything can be working fine - and then ubuntu automatically changes driver (rendering a broken system) - I recommend using timeshift to create a snapshot when everything is working (new driver / cuda update etc) - https://github.com/linuxmint/timeshift - it's trivial to roll back to working snapshot and you won't lose any personal files.

ThatCooperLewis commented 1 year ago

Trying to build containers on Arch here, installed Docker through docker-desktop originally, but I've also installed nvidia-docker, cuda, cuda-tools, cudnn, and nvidia-container-toolkit on the host machine in an attempt to resolve this.

The only workaround I've found so far is to run docker as root. That resolves this specific issue but, of course, I'd rather not be forced to run all my docker commands via sudo (also Docker Desktop fails to recognize those containers/images).

Some relevant outputs from host machine:

$ ldconfig -p | grep cuda         

        libicudata.so.72 (libc6,x86-64) => /usr/lib/libicudata.so.72
        libicudata.so.72 (ELF) => /usr/lib32/libicudata.so.72
        libicudata.so (libc6,x86-64) => /usr/lib/libicudata.so
        libicudata.so (ELF) => /usr/lib32/libicudata.so
        libcuda.so.1 (libc6,x86-64) => /usr/lib/libcuda.so.1
        libcuda.so.1 (libc6) => /usr/lib32/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/lib/libcuda.so
        libcuda.so (libc6) => /usr/lib32/libcuda.so
$ nvidia-smi
Wed Dec 14 15:20:49 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:0C:00.0  On |                  N/A |
|  0%   32C    P8    30W / 370W |    937MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
$ uname -a
6.0.12-arch1-1 NVIDIA/nvidia-docker#1 SMP PREEMPT_DYNAMIC Thu, 08 Dec 2022 11:03:38 +0000 x86_64 GNU/Linux
ThatCooperLewis commented 1 year ago

I changed distros and still had a very similar issue with Docker Desktop + nvidia-docker together. But adding this workaround to the nvidia runtime config seemed to fix things for me. [UPDATE: It does not]

$ vi /etc/nvidia-container-runtime/config.toml

no-cgroups = true

Unsure of whether this was the cause in my old distro (EndeavorOS), but I will try to confirm later.

shiwakant commented 1 year ago

All instructions were helpful, but I had to start docker, docker build, and docker run at root privileges to make it work!!! Even after repeated hard tries, unable to run at a user-level permissions.

pochoi commented 1 year ago

All instructions were helpful, but I had to start docker, docker build, and docker run at root privileges to make it work!!! Even after repeated hard tries, unable to run at a user-level permissions.

I have the same error [nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1](https://github.com/NVIDIA/nvidia-container-toolkit/issues/154) when running docker without sudo.

Is there possible ways to get thing work without sudo?

ThatCooperLewis commented 1 year ago

@shiwakant @pochoi You can get this working by avoiding Docker Desktop and instead setting up Docker Rootless Mode

lbadi commented 1 year ago

I know that this might be a dumb answer, but i was having the same issues and got fixed after i log in into docker login ghcr.io -u *** --password-stdin .

turboazot commented 1 year ago

From my side I used:

sudo ldconfig

Worked for me. But in case if you are using docker image with dind and nvidia-docker integration in it, execute this in entrypoint script, otherwise it may not work.

justmiles commented 1 year ago

I ran into this as well and was simply missing the nvidia-driver-<version> and nvidia-dkms-<version> packages. Would be worth double-checking the actual Nvidia drivers are installed.

pfcouto commented 1 year ago

Hello, can I bring up this topic again?

1. Issue or feature description

Upon running the command docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi i get the error

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

2. Steps to reproduce the issue

I installed nvidia through Fedora Docs, not Nvidia, so as an example nvcc --version outputs an error saying that it does not recognize nvcc command but in my host machine I can run nvidia-smi

The commands I used to install nvidia are the following:

sudo dnf install akmod-nvidia
sudo dnf install xorg-x11-drv-nvidia-cuda

And as visible in the following image I am able to run the command nvidia-smi in my host machine

image

I followed this guide on how yo install nvidia-docker - - and did the following:

curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
##############################
sudo dnf install nvidia-docker2
# Edit /etc/nvidia-container-runtime/config.toml and disable cgroups:
no-cgroups = true

sudo reboot
##############################
sudo systemctl start docker.service
##############################
docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

and upon running this docker command I get the error show in ### 1.

The thing is, I have the file that it says it is missing (check the following image), so maybe it is looking for it in a different directory?

image

3. Information to attach (optional if deemed irrelevant)

uname -a:

Linux fedora 6.2.10-200.fc37.x86_64 NVIDIA/nvidia-docker#1 SMP PREEMPT_DYNAMIC Thu Apr  6 23:30:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

docker version

Client: Docker Engine - Community
 Cloud integration: v1.0.31
 Version:           23.0.3
 API version:       1.41 (downgraded from 1.42)
 Go version:        go1.19.7
 Git commit:        3e7cbfd
 Built:             Tue Apr  4 22:10:33 2023
 OS/Arch:           linux/amd64
 Context:           desktop-linux

Server: Docker Desktop 4.18.0 (104112)
 Engine:
  Version:          20.10.24
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.19.7
  Git commit:       5d6db84
  Built:            Tue Apr  4 18:18:42 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.18
  GitCommit:        2456e983eb9e37e47538f59ea18f2043c9a73640
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

rpm -qa '*nvidia*'

 nvidia-gpu-firmware-20230310-148.fc37.noarch
xorg-x11-drv-nvidia-kmodsrc-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-cuda-libs-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-libs-530.41.03-1.fc37.x86_64
nvidia-settings-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-power-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-530.41.03-1.fc37.x86_64
akmod-nvidia-530.41.03-1.fc37.x86_64
kmod-nvidia-6.2.9-200.fc37.x86_64-530.41.03-1.fc37.x86_64
nvidia-persistenced-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-cuda-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-libs-530.41.03-1.fc37.i686
xorg-x11-drv-nvidia-cuda-libs-530.41.03-1.fc37.i686
kmod-nvidia-6.2.10-200.fc37.x86_64-530.41.03-1.fc37.x86_64
nvidia-container-toolkit-base-1.13.0-1.x86_64
libnvidia-container1-1.13.0-1.x86_64
libnvidia-container-tools-1.13.0-1.x86_64
nvidia-container-toolkit-1.13.0-1.x86_64
nvidia-docker2-2.13.0-1.noarch

nvidia-container-cli -V

cli-version: 1.13.0
lib-version: 1.13.0
build date: 2023-03-31T13:12+00:00
build revision: 20823911e978a50b33823a5783f92b6e345b241a
build compiler: gcc 8.5.0 20210514 (Red Hat 8.5.0-18)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Thanks for your help!

elezar commented 1 year ago

@pfcouto and others that show this behaviour. Please enable debug logging for the nvidia-container-cli in the /etc/nvidia-container-toolkit/config.toml by uncommenting the #debug = line in that section.

Running a container should then generate a log at /var/log/nvidia-container-toolkit.log which may help to further debug this.

Note that the NVIDIA Container CLI needs to load libnvidia-ml.so.1 to retrieve the required information about the GPUs in the system. We have seen this behaviour when Docker Desktop is used, for example, since the hook is then executed in a VM that does not have access to the libraries and devices on the host. How is docker installed in this case?

Note, if you're able to install a recent version of podman, this could be an alternative as a CDI specification could be generated instead of relying on the nvidia-container-cli-based injection.

pfcouto commented 1 year ago

Hello @elezar, I will do what you said said. Thanks! I think one of the issues is that since I installed Nvidia-Drivers through RPMfusion the file is not in the default location. Nvidia-Docker is looking for the file in a location, and I have the file in another location. How can I change the docker image to access my file that is in a different location?

As shown in the picture, when I installed the Nvidia-Drivers through RPM it installed a flatpak, but the file is present. The thing is the file is not where nvidia-docker expects it to be.

Can I create my own Dockerfile: Like:

FROM nvidia/docker
COPY (my lib file location) (Docker image location)

Or just change the default location in local machine to the correct location just to test if it works.

image

elezar commented 1 year ago

@pfcouto if the drivers are at a different location to expected, you could look at setting the root option in the config.toml. We use this setting when running the toolkit using our driver container. In this case we isntall the driver (and device nodes) to /run/nvidia/driver and root = /run/nvidia/driver is specified in the config.

pfcouto commented 1 year ago

Hello @elezar .I do not have the folder /etc/nvidia-container-toolkit/ that you mentioned. However I do have a folder nvidia-container-runtime which as a config,toml file, as shown in the picture. Is it ok for me to change in this file what you said to change in the other? Thanks!

image

pfcouto commented 1 year ago

I did change the file as visible in the first picture. Uncommented the debug line, and changed root to a directory where I have the libnvidia-ml.so.1 file, I don't know if I should have changed this, but I did. Ran the command docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi and it outputed the same error:

Unable to find image 'nvidia/cuda:11.0.3-base-ubuntu20.04' locally
11.0.3-base-ubuntu20.04: Pulling from nvidia/cuda
d7bfe07ed847: Pull complete 
75eccf561042: Pull complete 
191419884744: Pull complete 
a17a942db7e1: Pull complete 
16156c70987f: Pull complete 
Digest: sha256:57455121f3393b7ed9e5a0bc2b046f57ee7187ea9ec562a7d17bf8c97174040d
Status: Downloaded newer image for nvidia/cuda:11.0.3-base-ubuntu20.04
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0003] error waiting for container:

Then I went ahead and tried to look into the log file but it was not created... cat /var/log/nvidia-container-toolkit.log:

cat: /var/log/nvidia-container-toolkit.log: No such file or directory

image

pfcouto commented 1 year ago

Hi again @elezar, also, I don't have the folder /run/nvidia/driver

image

lishoulong commented 1 year ago

maybe just not install CUDA and NVIDIA container

elezar commented 1 year ago

Hi again @elezar, also, I don't have the folder /run/nvidia/driver

Sorry for the lack of clarity. I was using /run/nvidia/driver as an example of a path we use when intalling the driver using our driver container. The NVIDIA Container Toolkit considers the root setting when looking for libnvidia-ml.so.1 (the standard lib paths are prepended) and if your installation has these libraries at a non-standard location this will help to locate them.

Since your output in one of your comments does show /usr/lib64/libnvidia-ml.so.1, could you confirm where this symlink points? (Your output also shows some flatpack location).

Could you link the Fedora docs you used to install the driver?

AskAlice commented 1 year ago

I have this issue unless I run as root. Using docker-desktop

❯ stat /usr/lib/libnvidia-ml.*
  File: /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1
  Size: 17              Blocks: 8          IO Block: 4096   symbolic link
Device: 0,26    Inode: 1753370     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-09-01 13:46:31.971672402 -0600
Modify: 2023-08-22 11:37:15.000000000 -0600
Change: 2023-08-26 22:24:35.978217529 -0600
 Birth: 2023-08-26 22:24:35.978217529 -0600
  File: /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.535.104.05
  Size: 26              Blocks: 8          IO Block: 4096   symbolic link
Device: 0,26    Inode: 1753371     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-09-01 13:46:31.971672402 -0600
Modify: 2023-08-22 11:37:15.000000000 -0600
Change: 2023-08-26 22:24:35.978217529 -0600
 Birth: 2023-08-26 22:24:35.978217529 -0600
  File: /usr/lib/libnvidia-ml.so.535.104.05
  Size: 1815872         Blocks: 3552       IO Block: 4096   regular file
Device: 0,26    Inode: 1753372     Links: 1
Access: (0777/-rwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-09-01 14:25:25.495551694 -0600
Modify: 2023-08-22 11:37:15.000000000 -0600
Change: 2023-09-01 14:24:06.427728737 -0600
 Birth: 2023-08-26 22:24:35.978217529 -0600

It also seems it is reproducable with the PKGBUILD i created here https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/issues/17#note_1530784413

here is my config.toml

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = false
#user = "root:video"
ldconfig = "/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"

# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
    "docker-runc",
    "runc",
]

mode = "auto"

    [nvidia-container-runtime.modes.csv]

    mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"
bkocis commented 1 year ago

I had the same issue. For me a reinstall of docker fixed the issue:

I run as a bash script:

sudo apt-get update

sudo apt install apt-transport-https ca-certificates curl software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"

apt-cache policy docker-ce

sudo apt install docker-ce
RobQuistNL commented 11 months ago

Looks like this just doesn't work with docker desktop.

When you run the script that @bkocis shared - you're installing docker-ce, most likely next to docker desktop. So the sudo version of docker runs the CE version, and the regular one will use your docker desktop version.

At least, this is what happens for me :)

$ docker run --privileged --gpus all nvidia/cuda:12.2.2-runtime-ubuntu22.04 nvidia-smi
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0000] error waiting for container:                 
$ sudo docker run --privileged --gpus all nvidia/cuda:12.2.2-runtime-ubuntu22.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.2.2

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Wed Oct 25 18:34:56 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  Off |
|  0%   51C    P0    72W / 450W |   1493MiB / 24564MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Before installing docker-ce, you'd get this error:

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
JosephKuchar commented 11 months ago

Hi all,

I'm having this same issue. It's perplexing because everything was working as of a few weeks ago, but it seems that since we've had to reboot the machine that somehow docker's ability to access the GPU has broken. In my case running docker as sudo does not make a difference.

sudo docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

The output of nvidia-smi is the following:

Wed Oct 25 16:30:24 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 5000                 On | 00000000:B3:00.0 Off |                  Off |
| 33%   28C    P8               13W / 230W|     71MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2463      G   /usr/lib/xorg/Xorg                           63MiB |
|    0   N/A  N/A      2957      G   /usr/bin/gnome-shell                          5MiB |
+---------------------------------------------------------------------------------------+

I also edited the /etc/nvidia-container-toolkit/config.toml by uncommenting the #debug = line in that section. The error suggests it's not able to find the nvidia devices:

I1026 12:51:01.432712 1726142 nvc.c:376] initializing library context (version=1.12.0, build=7678e1af094d865441d0bc1b97>I1026 12:51:01.432857 1726142 nvc.c:350] using root /
I1026 12:51:01.432876 1726142 nvc.c:351] using ldcache /etc/ld.so.cache
I1026 12:51:01.432891 1726142 nvc.c:352] using unprivileged user 65534:65534
I1026 12:51:01.432931 1726142 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for>I1026 12:51:01.433368 1726142 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W1026 12:51:01.436341 1726142 nvc.c:258] failed to detect NVIDIA devices
I1026 12:51:01.436817 1726149 nvc.c:278] loading kernel module nvidia
I1026 12:51:01.437436 1726149 nvc.c:282] running mknod for /dev/nvidiactl
I1026 12:51:01.437525 1726149 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I1026 12:51:01.453626 1726149 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabi>I1026 12:51:01.453787 1726149 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabi>I1026 12:51:01.456764 1726149 nvc.c:296] loading kernel module nvidia_uvm
I1026 12:51:01.456957 1726149 nvc.c:300] running mknod for /dev/nvidia-uvm
I1026 12:51:01.457046 1726149 nvc.c:305] loading kernel module nvidia_modeset
I1026 12:51:01.457288 1726149 nvc.c:309] running mknod for /dev/nvidia-modeset
I1026 12:51:01.458019 1726150 rpc.c:71] starting driver rpc service
I1026 12:51:01.459088 1726142 rpc.c:132] driver rpc service terminated with signal 15
I1026 12:51:01.459205 1726142 nvc.c:434] shutting down library context

As I said, this was working a few weeks ago, so I'm not sure what's changed. We haven't updated any drivers or anything of that nature that I'm aware of. Any help appreciated!

bkocis commented 11 months ago

@JosephKuchar try reinstalling docker - I had similar problem, with the issue being the missing runtime (see docker info). the solution was for me to reinstall docker https://github.com/NVIDIA/nvidia-docker/issues/1648#issuecomment-1785033393

destefy commented 11 months ago

Thanks @bkocis! Worked like a charm!

archenroot commented 10 months ago

Hi guys,

I hit same issue on Ubuntu 22.04 LTS, I followed instructions to reinstall as bellow (note I have installed as well docker-desktop initially)

sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl restart docker

And I was able to run this machine from above comment: docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark and I was finally capable to run RAPIDS: docker run --gpus all --pull always --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/notebooks:23.12a-cuda11.2-py3.10

At moment the docker-desktop is uninstalled. I will try to install it again and run the tests.

archenroot commented 10 months ago

I additionally installed docker-desktop again and rebooted and gpu in containers still works....

goldwater668 commented 10 months ago

@johndpope I installed docker-desktop in Windows 10. The graphics card driver is 546.01 and the following error is reported. How should I solve it? docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

archenroot commented 10 months ago

@goldwater668 It seems that reinstalling the docker engine help on Linux machines..and as per my understanding its root cause is maybe in docker-desktop somewhere, but I didn't find root cause myself..

goldwater668 commented 10 months ago

@elezar nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

johndpope commented 10 months ago

@goldwater668 - I see you're on windows - try running things as administrator.

Starting Docker image cog-cog-svd-base and running setup()...
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ⅹ Failed to start container: exit status 127

running sudo fixed this for me.

lihkinVerma commented 9 months ago

Hi guys,

I hit same issue on Ubuntu 22.04 LTS, I followed instructions to reinstall as bellow (note I have installed as well docker-desktop initially)

sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl restart docker

And I was able to run this machine from above comment: docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark and I was finally capable to run RAPIDS: docker run --gpus all --pull always --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/notebooks:23.12a-cuda11.2-py3.10

At moment the docker-desktop is uninstalled. I will try to install it again and run the tests.

This worked for me. Thankyou so much

combofish commented 9 months ago

Hi guys, I hit same issue on Ubuntu 22.04 LTS, I followed instructions to reinstall as bellow (note I have installed as well docker-desktop initially)

sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl restart docker

And I was able to run this machine from above comment: docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark and I was finally capable to run RAPIDS: docker run --gpus all --pull always --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/notebooks:23.12a-cuda11.2-py3.10 At moment the docker-desktop is uninstalled. I will try to install it again and run the tests.

This worked for me. Thankyou so much

huangpan2507 commented 9 months ago

Hi guys,

I hit same issue on Ubuntu 22.04 LTS, I followed instructions to reinstall as bellow (note I have installed as well docker-desktop initially)

sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl restart docker

And I was able to run this machine from above comment: docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark and I was finally capable to run RAPIDS: docker run --gpus all --pull always --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/notebooks:23.12a-cuda11.2-py3.10

At moment the docker-desktop is uninstalled. I will try to install it again and run the tests.

Thanks,I had met the same issue, this solution helps me!!!!! But please note, all the docker images will be deleted because of sudo rm -rf /var/lib/docker !!!!

zebin-huang commented 8 months ago

I had the same issue. For me a reinstall of docker fixed the issue:

I run as a bash script:

sudo apt-get update

sudo apt install apt-transport-https ca-certificates curl software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"

apt-cache policy docker-ce

sudo apt install docker-ce

This works for me!

sjmach commented 8 months ago

I had the same issue. For me a reinstall of docker fixed the issue:

I run as a bash script:

sudo apt-get update

sudo apt install apt-transport-https ca-certificates curl software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"

apt-cache policy docker-ce

sudo apt install docker-ce

This works for me too. Please replace appropriate version of Ubuntu in the 4 line by running this command first in your terminal and getting the codename string lsb_release -a

clh15683 commented 7 months ago

If you encounter this with Docker Desktop, make sure that you enable the WSL integration for your distribution under Settings->Resources->WSL Integration. It seems that Docker Desktop occasionally forgets this setting on Updates.

GabrielDornelles commented 7 months ago

I followed this conversation last night, turned off the pc and good. Today I went back to work, and it wasnt working anymore, same error occuring of not finding libnvidia-ml.so.1.

I don't really know how to solve this, its package issues as pointed by others. What I had to do to make it work again was

sudo snap remove --purge docker

Removing the docker stuff (the previous long shell command is what worked before, but now it doesnt).

Then re installing everything again from docker instructions:

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
gerardo8a commented 7 months ago

I had the issue with the nvidia library as well and after looking into one of my working nodes the /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml was different the following two values where set different and that was the root of the error.

working

....
[nvidia-container-cli]
  environment = []
  ldconfig = "@/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/"
...

Not working

...
[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"
...
elezar commented 7 months ago

@gerardo8a how was your NVIDIA Container Toolkit and the NVIDIA driver installed? Your non-working config seems to reference a containerized driver insallation (usually under the GPU Operator), whereas your working config references a driver installation on the host (note the root and ldconfig values).

iganev commented 5 months ago

I followed this conversation last night, turned off the pc and good. Today I went back to work, and it wasnt working anymore, same error occuring of not finding libnvidia-ml.so.1.

I don't really know how to solve this, its package issues as pointed by others. What I had to do to make it work again was

sudo snap remove --purge docker

Removing the docker stuff (the previous long shell command is what worked before, but now it doesnt).

Then re installing everything again from docker instructions:

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

As Gabe here describes, if reinstalling docker apt reinstall docker-ce fixes your issue temporarily, make sure you don't have another docker installed through snap ( snap list ).

zhangxianwei2015 commented 5 months ago

For my case, configure the container runtime for Docker running in [Rootless mode] (https://docs.docker.com/engine/security/rootless/) works for me.

Robotgir commented 2 months ago

"apt-get upgrade" helped me resolving the issue