NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.5k stars 271 forks source link

Couldn't find libnvidia-ml.so library in your system #299

Closed PhilipDeegan closed 10 months ago

PhilipDeegan commented 4 years ago

System: Debian 10 buster-backports

See: https://github.com/NVIDIA/nvidia-docker/issues/854

The comment solves it: https://github.com/NVIDIA/nvidia-docker/issues/854#issuecomment-435420781

klueska commented 4 years ago

Can you run nvidia-container-cli -k -d /dev/tty info and provide the output.

PhilipDeegan commented 4 years ago
nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0605 14:23:04.356552 3849 nvc.c:281] initializing library context (version=1.1.1, build=e5d6156aba457559979597c8e3d22c5d8d0622db)
I0605 14:23:04.356672 3849 nvc.c:255] using root /
I0605 14:23:04.356703 3849 nvc.c:256] using ldcache /etc/ld.so.cache
I0605 14:23:04.356735 3849 nvc.c:257] using unprivileged user 1000:1000
W0605 14:23:04.485360 3850 nvc.c:186] failed to set inheritable capabilities
W0605 14:23:04.485463 3850 nvc.c:187] skipping kernel modules load due to failure
I0605 14:23:04.486029 3851 driver.c:101] starting driver service
I0605 14:23:04.491408 3849 nvc_info.c:541] requesting driver information with ''
I0605 14:23:04.492365 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.440.82
I0605 14:23:04.492403 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.440.82
I0605 14:23:04.492457 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.440.82
I0605 14:23:04.492506 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.440.82
I0605 14:23:04.492534 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.440.82
I0605 14:23:04.492562 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.440.82
I0605 14:23:04.492590 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.440.82
I0605 14:23:04.492621 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.440.82
I0605 14:23:04.492648 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.440.82
I0605 14:23:04.492716 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.440.82
I0605 14:23:04.492743 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.440.82
I0605 14:23:04.492908 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.440.82
I0605 14:23:04.493076 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.440.82
I0605 14:23:04.493160 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv2_nvidia.so.440.82
I0605 14:23:04.493239 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.440.82
I0605 14:23:04.493321 3849 nvc_info.c:155] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.440.82
W0605 14:23:04.493366 3849 nvc_info.c:306] missing library libnvidia-opencl.so
W0605 14:23:04.493384 3849 nvc_info.c:306] missing library libnvidia-allocator.so
W0605 14:23:04.493387 3849 nvc_info.c:306] missing library libnvidia-compiler.so
W0605 14:23:04.493390 3849 nvc_info.c:306] missing library libvdpau_nvidia.so
W0605 14:23:04.493412 3849 nvc_info.c:306] missing library libnvidia-encode.so
W0605 14:23:04.493417 3849 nvc_info.c:306] missing library libnvidia-opticalflow.so
W0605 14:23:04.493422 3849 nvc_info.c:306] missing library libnvcuvid.so
W0605 14:23:04.493428 3849 nvc_info.c:306] missing library libnvidia-fbc.so
W0605 14:23:04.493434 3849 nvc_info.c:306] missing library libnvidia-ifr.so
W0605 14:23:04.493440 3849 nvc_info.c:306] missing library libnvoptix.so
W0605 14:23:04.493446 3849 nvc_info.c:310] missing compat32 library libnvidia-ml.so
W0605 14:23:04.493452 3849 nvc_info.c:310] missing compat32 library libnvidia-cfg.so
W0605 14:23:04.493458 3849 nvc_info.c:310] missing compat32 library libcuda.so
W0605 14:23:04.493463 3849 nvc_info.c:310] missing compat32 library libnvidia-opencl.so
W0605 14:23:04.493469 3849 nvc_info.c:310] missing compat32 library libnvidia-ptxjitcompiler.so
W0605 14:23:04.493475 3849 nvc_info.c:310] missing compat32 library libnvidia-fatbinaryloader.so
W0605 14:23:04.493480 3849 nvc_info.c:310] missing compat32 library libnvidia-allocator.so
W0605 14:23:04.493486 3849 nvc_info.c:310] missing compat32 library libnvidia-compiler.so
W0605 14:23:04.493491 3849 nvc_info.c:310] missing compat32 library libvdpau_nvidia.so
W0605 14:23:04.493497 3849 nvc_info.c:310] missing compat32 library libnvidia-encode.so
W0605 14:23:04.493503 3849 nvc_info.c:310] missing compat32 library libnvidia-opticalflow.so
W0605 14:23:04.493508 3849 nvc_info.c:310] missing compat32 library libnvcuvid.so
W0605 14:23:04.493514 3849 nvc_info.c:310] missing compat32 library libnvidia-eglcore.so
W0605 14:23:04.493520 3849 nvc_info.c:310] missing compat32 library libnvidia-glcore.so
W0605 14:23:04.493526 3849 nvc_info.c:310] missing compat32 library libnvidia-tls.so
W0605 14:23:04.493531 3849 nvc_info.c:310] missing compat32 library libnvidia-glsi.so
W0605 14:23:04.493537 3849 nvc_info.c:310] missing compat32 library libnvidia-fbc.so
W0605 14:23:04.493543 3849 nvc_info.c:310] missing compat32 library libnvidia-ifr.so
W0605 14:23:04.493548 3849 nvc_info.c:310] missing compat32 library libnvidia-rtcore.so
W0605 14:23:04.493554 3849 nvc_info.c:310] missing compat32 library libnvoptix.so
W0605 14:23:04.493560 3849 nvc_info.c:310] missing compat32 library libGLX_nvidia.so
W0605 14:23:04.493566 3849 nvc_info.c:310] missing compat32 library libEGL_nvidia.so
W0605 14:23:04.493572 3849 nvc_info.c:310] missing compat32 library libGLESv2_nvidia.so
W0605 14:23:04.493577 3849 nvc_info.c:310] missing compat32 library libGLESv1_CM_nvidia.so
W0605 14:23:04.493583 3849 nvc_info.c:310] missing compat32 library libnvidia-glvkspirv.so
W0605 14:23:04.493589 3849 nvc_info.c:310] missing compat32 library libnvidia-cbl.so
I0605 14:23:04.494053 3849 nvc_info.c:236] selecting /usr/lib/nvidia/current/nvidia-smi
I0605 14:23:04.494088 3849 nvc_info.c:236] selecting /usr/lib/nvidia/current/nvidia-debugdump
I0605 14:23:04.494104 3849 nvc_info.c:236] selecting /usr/bin/nvidia-persistenced
W0605 14:23:04.494180 3849 nvc_info.c:332] missing binary nvidia-cuda-mps-control
W0605 14:23:04.494184 3849 nvc_info.c:332] missing binary nvidia-cuda-mps-server
I0605 14:23:04.494230 3849 nvc_info.c:373] listing device /dev/nvidiactl
I0605 14:23:04.494235 3849 nvc_info.c:373] listing device /dev/nvidia-uvm
I0605 14:23:04.494240 3849 nvc_info.c:373] listing device /dev/nvidia-uvm-tools
I0605 14:23:04.494245 3849 nvc_info.c:373] listing device /dev/nvidia-modeset
I0605 14:23:04.494269 3849 nvc_info.c:277] listing ipc /run/nvidia-persistenced/socket
W0605 14:23:04.494282 3849 nvc_info.c:281] missing ipc /tmp/nvidia-mps
I0605 14:23:04.494287 3849 nvc_info.c:598] requesting device information with ''
I0605 14:23:04.501948 3849 nvc_info.c:637] listing device /dev/nvidia0 (GPU-7f3a0163-e7e5-79f9-edde-fd270af77272 at 00000000:01:00.0)
NVRM version:   440.82
CUDA version:   10.2

Device Index:   0
Device Minor:   0
Model:          GeForce GTX 1080 with Max-Q Design
Brand:          GeForce
GPU UUID:       GPU-7f3a0163-e7e5-79f9-edde-fd270af77272
Bus Location:   00000000:01:00.0
Architecture:   6.1
I0605 14:23:04.502072 3849 nvc.c:318] shutting down library context
I0605 14:23:04.502734 3851 driver.c:156] terminating driver service
I0605 14:23:04.503508 3849 driver.c:196] driver service terminated successfully
klueska commented 4 years ago

It seems that ldconfig is not being triggered properly by libnvidia-container on your system. I'm not 100% familiar with buster-backports. I know that for debian 9 and 10 (i.e. not with backports) no indirection through ldconfig.real is necessary to get access to the "real" ldconfig (unlike on Debian 8 and prior, as well as all Ubuntu based systems).

There is a configuration in /etc/nvidia-container-runtime/config.toml that lets nvidia-docker know what your "real" ldconfig is. If for some reason buster-backports has reintroduced an ldconfig.real file, then this configuration file will need to be updated to point to it.

Can you try and tab-complete on the name ldconfig and see if an ldconfig.real file shows up? If so, that is your issue, and you need to customize /etc/nvidia-container-runtime/config.toml appropriately.

harperreed commented 4 years ago

Running into this same issue .

System: Debian 10, buster-backports enabled.

I can run nvidia-smi outside of a container with no issue. Once in a container it fails like @Dekken's example.

When i tab complete ldconfig it is just ldconfig (not ldconfig.real)

Like in other issues around this issue, if i use

docker run --gpus=all --rm nvidia/cuda bash -c "ldconfig;nvidia-smi" than nvidia-smi works

if i use

docker run --gpus=all --rm nvidia/cuda bash -c "nvidia-smi" than i get the above error

regzon commented 4 years ago

The same issue here.

System: Debian Testing (bullseye)

ldconfig is available at path /sbin/ldconfig

Note: /sbin is a symlink to /usr/sbin

Log of nvidia-container-toolkit:


-- WARNING, the following logs are for debugging purposes only --

I0916 13:21:42.294324 22019 nvc.c:282] initializing library context (version=1.2.0, build=d22237acaea94aa5ad5de70aac903534ed598819)
I0916 13:21:42.294442 22019 nvc.c:256] using root /
I0916 13:21:42.294461 22019 nvc.c:257] using ldcache /etc/ld.so.cache
I0916 13:21:42.294477 22019 nvc.c:258] using unprivileged user 65534:65534
I0916 13:21:42.294514 22019 nvc.c:299] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0916 13:21:42.294810 22019 nvc.c:301] dxcore initialization failed, continuing assuming a non-WSL environment
I0916 13:21:42.298069 22023 nvc.c:192] loading kernel module nvidia
I0916 13:21:42.298488 22023 nvc.c:204] loading kernel module nvidia_uvm
I0916 13:21:42.298656 22023 nvc.c:212] loading kernel module nvidia_modeset
I0916 13:21:42.299161 22024 driver.c:101] starting driver service
I0916 13:21:42.303174 22019 nvc_container.c:364] configuring container with 'compute utility supervised'
I0916 13:21:42.303437 22019 nvc_container.c:212] selecting /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/local/cuda-10.1/compat/libcuda.so.418.152.00
I0916 13:21:42.303510 22019 nvc_container.c:212] selecting /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/local/cuda-10.1/compat/libnvidia-fatbinaryloader.so.418.152.00
I0916 13:21:42.303555 22019 nvc_container.c:212] selecting /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/local/cuda-10.1/compat/libnvidia-ptxjitcompiler.so.418.152.00
I0916 13:21:42.303733 22019 nvc_container.c:384] setting pid to 21995
I0916 13:21:42.303745 22019 nvc_container.c:385] setting rootfs to /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged
I0916 13:21:42.303755 22019 nvc_container.c:386] setting owner to 0:0
I0916 13:21:42.303765 22019 nvc_container.c:387] setting bins directory to /usr/bin
I0916 13:21:42.303775 22019 nvc_container.c:388] setting libs directory to /usr/lib/x86_64-linux-gnu
I0916 13:21:42.303784 22019 nvc_container.c:389] setting libs32 directory to /usr/lib/i386-linux-gnu
I0916 13:21:42.303794 22019 nvc_container.c:390] setting cudart directory to /usr/local/cuda
I0916 13:21:42.303803 22019 nvc_container.c:391] setting ldconfig to @/sbin/ldconfig (host relative)
I0916 13:21:42.303813 22019 nvc_container.c:392] setting mount namespace to /proc/21995/ns/mnt
I0916 13:21:42.303823 22019 nvc_container.c:394] setting devices cgroup to /sys/fs/cgroup/devices/docker/8324917e27bd8d74b848f4d2a73fcc0f580e272562c48686c838786c9f31f6b7
I0916 13:21:42.303838 22019 nvc_info.c:679] requesting driver information with ''
I0916 13:21:42.305835 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.450.66
I0916 13:21:42.305905 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.450.66
I0916 13:21:42.306012 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.450.66
I0916 13:21:42.306110 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.450.66
I0916 13:21:42.306166 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.450.66
I0916 13:21:42.306218 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.450.66
I0916 13:21:42.306273 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.66
I0916 13:21:42.306325 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.450.66
I0916 13:21:42.306451 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.450.66
I0916 13:21:42.306506 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.450.66
I0916 13:21:42.306877 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.450.66
I0916 13:21:42.307172 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.450.66
I0916 13:21:42.307282 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv2_nvidia.so.450.66
I0916 13:21:42.307380 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libGLESv1_CM_nvidia.so.450.66
I0916 13:21:42.307476 22019 nvc_info.c:168] selecting /usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.450.66
W0916 13:21:42.307538 22019 nvc_info.c:349] missing library libnvidia-opencl.so
W0916 13:21:42.307549 22019 nvc_info.c:349] missing library libnvidia-fatbinaryloader.so
W0916 13:21:42.307559 22019 nvc_info.c:349] missing library libnvidia-allocator.so
W0916 13:21:42.307568 22019 nvc_info.c:349] missing library libnvidia-compiler.so
W0916 13:21:42.307578 22019 nvc_info.c:349] missing library libnvidia-ngx.so
W0916 13:21:42.307587 22019 nvc_info.c:349] missing library libvdpau_nvidia.so
W0916 13:21:42.307597 22019 nvc_info.c:349] missing library libnvidia-encode.so
W0916 13:21:42.307607 22019 nvc_info.c:349] missing library libnvidia-opticalflow.so
W0916 13:21:42.307616 22019 nvc_info.c:349] missing library libnvcuvid.so
W0916 13:21:42.307626 22019 nvc_info.c:349] missing library libnvidia-fbc.so
W0916 13:21:42.307635 22019 nvc_info.c:349] missing library libnvidia-ifr.so
W0916 13:21:42.307645 22019 nvc_info.c:349] missing library libnvoptix.so
W0916 13:21:42.307654 22019 nvc_info.c:353] missing compat32 library libnvidia-ml.so
W0916 13:21:42.307664 22019 nvc_info.c:353] missing compat32 library libnvidia-cfg.so
W0916 13:21:42.307674 22019 nvc_info.c:353] missing compat32 library libcuda.so
W0916 13:21:42.307683 22019 nvc_info.c:353] missing compat32 library libnvidia-opencl.so
W0916 13:21:42.307693 22019 nvc_info.c:353] missing compat32 library libnvidia-ptxjitcompiler.so
W0916 13:21:42.307702 22019 nvc_info.c:353] missing compat32 library libnvidia-fatbinaryloader.so
W0916 13:21:42.307712 22019 nvc_info.c:353] missing compat32 library libnvidia-allocator.so
W0916 13:21:42.307722 22019 nvc_info.c:353] missing compat32 library libnvidia-compiler.so
W0916 13:21:42.307731 22019 nvc_info.c:353] missing compat32 library libnvidia-ngx.so
W0916 13:21:42.307741 22019 nvc_info.c:353] missing compat32 library libvdpau_nvidia.so
W0916 13:21:42.307750 22019 nvc_info.c:353] missing compat32 library libnvidia-encode.so
W0916 13:21:42.307760 22019 nvc_info.c:353] missing compat32 library libnvidia-opticalflow.so
W0916 13:21:42.307769 22019 nvc_info.c:353] missing compat32 library libnvcuvid.so
W0916 13:21:42.307779 22019 nvc_info.c:353] missing compat32 library libnvidia-eglcore.so
W0916 13:21:42.307788 22019 nvc_info.c:353] missing compat32 library libnvidia-glcore.so
W0916 13:21:42.307798 22019 nvc_info.c:353] missing compat32 library libnvidia-tls.so
W0916 13:21:42.307808 22019 nvc_info.c:353] missing compat32 library libnvidia-glsi.so
W0916 13:21:42.307817 22019 nvc_info.c:353] missing compat32 library libnvidia-fbc.so
W0916 13:21:42.307827 22019 nvc_info.c:353] missing compat32 library libnvidia-ifr.so
W0916 13:21:42.307836 22019 nvc_info.c:353] missing compat32 library libnvidia-rtcore.so
W0916 13:21:42.307846 22019 nvc_info.c:353] missing compat32 library libnvoptix.so
W0916 13:21:42.307855 22019 nvc_info.c:353] missing compat32 library libGLX_nvidia.so
W0916 13:21:42.307865 22019 nvc_info.c:353] missing compat32 library libEGL_nvidia.so
W0916 13:21:42.307874 22019 nvc_info.c:353] missing compat32 library libGLESv2_nvidia.so
W0916 13:21:42.307884 22019 nvc_info.c:353] missing compat32 library libGLESv1_CM_nvidia.so
W0916 13:21:42.307893 22019 nvc_info.c:353] missing compat32 library libnvidia-glvkspirv.so
W0916 13:21:42.307903 22019 nvc_info.c:353] missing compat32 library libnvidia-cbl.so
I0916 13:21:42.308377 22019 nvc_info.c:275] selecting /usr/lib/nvidia/current/nvidia-smi
I0916 13:21:42.308445 22019 nvc_info.c:275] selecting /usr/lib/nvidia/current/nvidia-debugdump
I0916 13:21:42.308476 22019 nvc_info.c:275] selecting /usr/bin/nvidia-persistenced
W0916 13:21:42.308822 22019 nvc_info.c:375] missing binary nvidia-cuda-mps-control
W0916 13:21:42.308837 22019 nvc_info.c:375] missing binary nvidia-cuda-mps-server
I0916 13:21:42.308878 22019 nvc_info.c:437] listing device /dev/nvidiactl
I0916 13:21:42.308888 22019 nvc_info.c:437] listing device /dev/nvidia-uvm
I0916 13:21:42.308898 22019 nvc_info.c:437] listing device /dev/nvidia-uvm-tools
I0916 13:21:42.308907 22019 nvc_info.c:437] listing device /dev/nvidia-modeset
I0916 13:21:42.308948 22019 nvc_info.c:316] listing ipc /run/nvidia-persistenced/socket
W0916 13:21:42.308973 22019 nvc_info.c:320] missing ipc /tmp/nvidia-mps
I0916 13:21:42.308984 22019 nvc_info.c:744] requesting device information with ''
I0916 13:21:42.316877 22019 nvc_info.c:627] listing device /dev/nvidia0 (GPU-6064a007-a943-7f11-1ad7-12ac87046652 at 00000000:01:00.0)
I0916 13:21:42.317002 22019 nvc_mount.c:309] mounting tmpfs at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/proc/driver/nvidia
I0916 13:21:42.317443 22019 nvc_mount.c:77] mounting /usr/lib/nvidia/current/nvidia-smi at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/bin/nvidia-smi
I0916 13:21:42.317518 22019 nvc_mount.c:77] mounting /usr/lib/nvidia/current/nvidia-debugdump at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/bin/nvidia-debugdump
I0916 13:21:42.317583 22019 nvc_mount.c:77] mounting /usr/bin/nvidia-persistenced at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/bin/nvidia-persistenced
I0916 13:21:42.317771 22019 nvc_mount.c:77] mounting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.450.66 at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.66
I0916 13:21:42.317842 22019 nvc_mount.c:77] mounting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.450.66 at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.450.66
I0916 13:21:42.317908 22019 nvc_mount.c:77] mounting /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.450.66 at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/lib/x86_64-linux-gnu/libcuda.so.450.66
I0916 13:21:42.317972 22019 nvc_mount.c:77] mounting /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.450.66 at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.66
I0916 13:21:42.317999 22019 nvc_mount.c:489] creating symlink /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
I0916 13:21:42.318135 22019 nvc_mount.c:77] mounting /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/local/cuda-10.1/compat/libcuda.so.418.152.00 at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/lib/x86_64-linux-gnu/libcuda.so.418.152.00
I0916 13:21:42.318203 22019 nvc_mount.c:77] mounting /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/local/cuda-10.1/compat/libnvidia-fatbinaryloader.so.418.152.00 at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.152.00
I0916 13:21:42.318269 22019 nvc_mount.c:77] mounting /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/local/cuda-10.1/compat/libnvidia-ptxjitcompiler.so.418.152.00 at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.418.152.00
I0916 13:21:42.318421 22019 nvc_mount.c:204] mounting /run/nvidia-persistenced/socket at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/run/nvidia-persistenced/socket
I0916 13:21:42.318497 22019 nvc_mount.c:173] mounting /dev/nvidiactl at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/dev/nvidiactl
I0916 13:21:42.318530 22019 nvc_mount.c:464] whitelisting device node 195:255
I0916 13:21:42.318606 22019 nvc_mount.c:173] mounting /dev/nvidia-uvm at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/dev/nvidia-uvm
I0916 13:21:42.318632 22019 nvc_mount.c:464] whitelisting device node 241:0
I0916 13:21:42.318690 22019 nvc_mount.c:173] mounting /dev/nvidia-uvm-tools at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/dev/nvidia-uvm-tools
I0916 13:21:42.318715 22019 nvc_mount.c:464] whitelisting device node 241:1
I0916 13:21:42.318787 22019 nvc_mount.c:173] mounting /dev/nvidia0 at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/dev/nvidia0
I0916 13:21:42.318895 22019 nvc_mount.c:377] mounting /proc/driver/nvidia/gpus/0000:01:00.0 at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged/proc/driver/nvidia/gpus/0000:01:00.0
I0916 13:21:42.318924 22019 nvc_mount.c:464] whitelisting device node 195:0
I0916 13:21:42.318953 22019 nvc_ldcache.c:359] executing /sbin/ldconfig from host at /var/lib/docker/overlay2/fc0032cdaed5f3807bf66ecbf3ea00d728b44a4d66206ec4bf15b06a10ea49a7/merged
E0916 13:21:42.320363 1 nvc_ldcache.c:390] could not start /sbin/ldconfig: process execution failed: no such file or directory
I0916 13:21:42.320579 22019 nvc.c:337] shutting down library context
I0916 13:21:42.321308 22024 driver.c:156] terminating driver service
I0916 13:21:42.321735 22019 driver.c:196] driver service terminated successfully

Interesting parts is: could not start /sbin/ldconfig: process execution failed: no such file or directory

Location of ldconfig is set correctly:

ls -l /sbin/ | grep ldconfig

-rwxr-xr-x 1 root root    950056 Aug  4 18:02 ldconfig

Running ldconfig in container manually helps. Also, removing @ helps too but not with every image. For example running nvidia/cuda:11.0-runtime-ubuntu20.04 gives an error:
nvidia-container-cli: ldcache error: process /usr/sbin/ldconfig failed with error code: 127

klueska commented 4 years ago

As mentioned above in: https://github.com/NVIDIA/nvidia-container-toolkit/issues/299

Does your /etc/nvidia-container-runtime/config.toml file to point to the full absolute path of ldconfig? If not, try updating it and see if that fixes things.

regzon commented 4 years ago

Thank you for your reply. I've tested this already and got the same result:
could not start /usr/sbin/ldconfig: process execution failed: no such file or directory

regzon commented 4 years ago

I believe this is the ldconfig path you're asking about.

Some additional info:

> which ldconfig
/usr/sbin/ldconfig

> ls -l /usr/sbin/ | grep ldconfig
-rwxr-xr-x 1 root root    950056 Aug  4 18:02 ldconfig
klueska commented 4 years ago

Is ldconfig under /usr/sbin or just /sbin? I know you said that one is a symlink to the other, but make sure that the actual location of ldconfig is the one in the config file.

regzon commented 4 years ago

I've spent some time searching for another possible ldconfig binaries and found nothing. The one that is in /usr/sbin is not a symlink (and every parent directory too). Path /usr/sbin/ldconfig is the location that is used by the system (as which ldconfig says). Also, there are no ldconfig.real binaries presented (neither in the path, nor in /usr/sbin).

LS outputs (proof of no symlinks):

> ls -l / | grep usr
drwxr-xr-x  14 root root  4096 Sep 21  2019 usr

> ls -l /usr/ | grep sbin
drwxr-xr-x   2 root root 20480 Sep 13 15:32 sbin

> ls -l /usr/sbin/ | grep ldconfig
-rwxr-xr-x 1 root root    950056 Aug  4 18:02 ldconfig

The configuration file:

> cat /etc/nvidia-container-runtime/config.toml
disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/usr/sbin/ldconfig"

[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"

I hope that I miss something obvious :)

deric commented 2 years ago

It seems to be working for me with full path pointing to ldconfig.real:

ldconfig = "/sbin/ldconfig.real"

@klueska what it the point of the @ prefix?

ldconfig = "@/sbin/ldconfig"

I'm using nvidia-driver from bullseye-backports/non-free at version 470.94-1~bpo11+1

klueska commented 2 years ago

The @ prefix means -- look for the following path on the host and run ldconfig from there. Without the @ prefix it looks for the path inside the container and executes it.

klueska commented 2 years ago

The newest version of nvidia-docker should resolve these issues with ldconfig not properly setting up the library search path on debian systems before a container gets launched.

Specifically this change in libnvidia-container fixes the issue and is included as part of the latest release: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/141

The latest release packages for the full nvidia-docker stack:

libnvidia-container1-1.9.0
libnvidia-container-tools-1.9.0
nvidia-container-toolkit-1.9.0
nvidia-container-runtime-3.9.0
nvidia-docker-2.10.0