NVIDIA / libnvidia-container

NVIDIA container runtime library
Apache License 2.0
835 stars 204 forks source link

[ERROR] nvidia-container-cli: ldcache error: process /usr/sbin/ldconfig failed with error code: 1 #177

Open SolenoidWGT opened 2 years ago

SolenoidWGT commented 2 years ago

Hi,When I follow the documentation to test:

# enroot create --name cuda nvidia+cuda+10.0-base.sqsh # enroot start --root --rw cuda sh -c 'pwd'

enroot reported an ERROR:

nvidia-container-cli: ldcache error: process /usr/sbin/ldconfig failed with error code: 1 [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

The error seems to be GPU related, and if I start an ubuntu image there is no error. I'm not quite sure what's causing this error,looking forward to your reply.

Below is some configuration of my system:

enroot version is 3.4.0 OS is CentOS Linux release 7.6.1810 (Core)

./enroot-check_*.run --verify

Kernel version:

Linux version 3.10.0-957.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-36) (GCC) ) NVIDIA/enroot#1 SMP Thu Nov 8 23:39:32 UTC 2018

Kernel configuration:

CONFIG_NAMESPACES : OK CONFIG_USER_NS : OK CONFIG_SECCOMP_FILTER : OK CONFIG_OVERLAY_FS : OK (module) CONFIG_X86_VSYSCALL_EMULATION : KO (required if glibc <= 2.13) CONFIG_VSYSCALL_EMULATE : KO (required if glibc <= 2.13) CONFIG_VSYSCALL_NATIVE : KO (required if glibc <= 2.13)

Kernel command line:

namespace.unpriv_enable=1 : OK user_namespace.enable=1 : OK vsyscall=native : KO (required if glibc <= 2.13) vsyscall=emulate : KO (required if glibc <= 2.13)

Kernel parameters:

user.max_user_namespaces : OK user.max_mnt_namespaces : OK

Extra packages:

nvidia-container-cli : OK

nvidia-container-cli -V

cli-version: 1.10.0 lib-version: 1.10.0 build date: 2022-06-13T11:19+0000 build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44) build platform: x86_64

Here is nvidia-container-cli output: # nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0708 09:39:49.685557 216911 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae) I0708 09:39:49.685655 216911 nvc.c:350] using root / I0708 09:39:49.685663 216911 nvc.c:351] using ldcache /etc/ld.so.cache I0708 09:39:49.685671 216911 nvc.c:352] using unprivileged user 65534:65534 I0708 09:39:49.685692 216911 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0708 09:39:49.685813 216911 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment I0708 09:39:49.712905 216912 nvc.c:278] loading kernel module nvidia I0708 09:39:49.713415 216912 nvc.c:282] running mknod for /dev/nvidiactl I0708 09:39:49.713482 216912 nvc.c:286] running mknod for /dev/nvidia0 I0708 09:39:49.713528 216912 nvc.c:286] running mknod for /dev/nvidia1 I0708 09:39:49.713568 216912 nvc.c:286] running mknod for /dev/nvidia2 I0708 09:39:49.713607 216912 nvc.c:286] running mknod for /dev/nvidia3 I0708 09:39:49.713645 216912 nvc.c:286] running mknod for /dev/nvidia4 I0708 09:39:49.713683 216912 nvc.c:286] running mknod for /dev/nvidia5 I0708 09:39:49.713721 216912 nvc.c:286] running mknod for /dev/nvidia6 I0708 09:39:49.713760 216912 nvc.c:286] running mknod for /dev/nvidia7 I0708 09:39:49.713798 216912 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps I0708 09:39:49.713818 216912 nvc.c:296] loading kernel module nvidia_uvm I0708 09:39:49.713962 216912 nvc.c:300] running mknod for /dev/nvidia-uvm I0708 09:39:49.714065 216912 nvc.c:305] loading kernel module nvidia_modeset I0708 09:39:49.714307 216912 nvc.c:309] running mknod for /dev/nvidia-modeset I0708 09:39:49.714611 216913 rpc.c:71] starting driver rpc service I0708 09:39:49.719789 216917 rpc.c:71] starting nvcgo rpc service I0708 09:39:49.721147 216911 nvc_info.c:766] requesting driver information with '' I0708 09:39:49.723229 216911 nvc_info.c:173] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.418.67 I0708 09:39:49.723493 216911 nvc_info.c:173] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.418.67 I0708 09:39:49.723590 216911 nvc_info.c:173] selecting /usr/lib64/libnvidia-opticalflow.so.418.67 I0708 09:39:49.723688 216911 nvc_info.c:173] selecting /usr/lib64/libnvidia-opencl.so.418.67 I0708 09:39:49.723749 216911 nvc_info.c:173] selecting /usr/lib64/libnvidia-ml.so.418.67 I0708 09:39:49.723832 216911 nvc_info.c:173] selecting /usr/lib64/libnvidia-ifr.so.418.67 I0708 09:39:49.723918 216911 nvc_info.c:173] selecting /usr/lib64/libnvidia-fbc.so.418.67 I0708 09:39:49.723997 216911 nvc_info.c:173] selecting /usr/lib64/libnvidia-fatbinaryloader.so.418.67 I0708 09:39:49.724057 216911 nvc_info.c:173] selecting /usr/lib64/libnvidia-encode.so.418.67 I0708 09:39:49.724138 216911 nvc_info.c:173] selecting /usr/lib64/libnvidia-compiler.so.418.67 I0708 09:39:49.724203 216911 nvc_info.c:173] selecting /usr/lib64/libnvidia-cfg.so.418.67 I0708 09:39:49.724289 216911 nvc_info.c:173] selecting /usr/lib64/libnvcuvid.so.418.67 I0708 09:39:49.724622 216911 nvc_info.c:173] selecting /usr/lib64/libcuda.so.418.67 I0708 09:39:49.724882 216911 nvc_info.c:173] selecting /usr/lib/vdpau/libvdpau_nvidia.so.418.67 I0708 09:39:49.724946 216911 nvc_info.c:173] selecting /usr/lib/libnvidia-ptxjitcompiler.so.418.67 I0708 09:39:49.725030 216911 nvc_info.c:173] selecting /usr/lib/libnvidia-opticalflow.so.418.67 I0708 09:39:49.725116 216911 nvc_info.c:173] selecting /usr/lib/libnvidia-opencl.so.418.67 I0708 09:39:49.725176 216911 nvc_info.c:173] selecting /usr/lib/libnvidia-ml.so.418.67 I0708 09:39:49.725271 216911 nvc_info.c:173] selecting /usr/lib/libnvidia-ifr.so.418.67 I0708 09:39:49.725352 216911 nvc_info.c:173] selecting /usr/lib/libnvidia-fbc.so.418.67 I0708 09:39:49.725433 216911 nvc_info.c:173] selecting /usr/lib/libnvidia-fatbinaryloader.so.418.67 I0708 09:39:49.725493 216911 nvc_info.c:173] selecting /usr/lib/libnvidia-encode.so.418.67 I0708 09:39:49.725574 216911 nvc_info.c:173] selecting /usr/lib/libnvidia-compiler.so.418.67 I0708 09:39:49.725634 216911 nvc_info.c:173] selecting /usr/lib/libnvcuvid.so.418.67 I0708 09:39:49.725715 216911 nvc_info.c:173] selecting /usr/lib/libcuda.so.418.67 W0708 09:39:49.725773 216911 nvc_info.c:399] missing library libnvidia-nscq.so W0708 09:39:49.725798 216911 nvc_info.c:399] missing library libcudadebugger.so W0708 09:39:49.725821 216911 nvc_info.c:399] missing library libnvidia-allocator.so W0708 09:39:49.725844 216911 nvc_info.c:399] missing library libnvidia-pkcs11.so W0708 09:39:49.725863 216911 nvc_info.c:399] missing library libnvidia-ngx.so W0708 09:39:49.725882 216911 nvc_info.c:399] missing library libnvidia-eglcore.so W0708 09:39:49.725905 216911 nvc_info.c:399] missing library libnvidia-glcore.so W0708 09:39:49.725924 216911 nvc_info.c:399] missing library libnvidia-tls.so W0708 09:39:49.725944 216911 nvc_info.c:399] missing library libnvidia-glsi.so W0708 09:39:49.725965 216911 nvc_info.c:399] missing library libnvidia-rtcore.so W0708 09:39:49.725982 216911 nvc_info.c:399] missing library libnvoptix.so W0708 09:39:49.726001 216911 nvc_info.c:399] missing library libGLX_nvidia.so W0708 09:39:49.726019 216911 nvc_info.c:399] missing library libEGL_nvidia.so W0708 09:39:49.726039 216911 nvc_info.c:399] missing library libGLESv2_nvidia.so W0708 09:39:49.726056 216911 nvc_info.c:399] missing library libGLESv1_CM_nvidia.so W0708 09:39:49.726079 216911 nvc_info.c:399] missing library libnvidia-glvkspirv.so W0708 09:39:49.726094 216911 nvc_info.c:399] missing library libnvidia-cbl.so W0708 09:39:49.726112 216911 nvc_info.c:403] missing compat32 library libnvidia-cfg.so W0708 09:39:49.726131 216911 nvc_info.c:403] missing compat32 library libnvidia-nscq.so W0708 09:39:49.726150 216911 nvc_info.c:403] missing compat32 library libcudadebugger.so W0708 09:39:49.726168 216911 nvc_info.c:403] missing compat32 library libnvidia-allocator.so W0708 09:39:49.726188 216911 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so W0708 09:39:49.726215 216911 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W0708 09:39:49.726236 216911 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so W0708 09:39:49.726257 216911 nvc_info.c:403] missing compat32 library libnvidia-glcore.so W0708 09:39:49.726274 216911 nvc_info.c:403] missing compat32 library libnvidia-tls.so W0708 09:39:49.726293 216911 nvc_info.c:403] missing compat32 library libnvidia-glsi.so W0708 09:39:49.726315 216911 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W0708 09:39:49.726335 216911 nvc_info.c:403] missing compat32 library libnvoptix.so W0708 09:39:49.726356 216911 nvc_info.c:403] missing compat32 library libGLX_nvidia.so W0708 09:39:49.726374 216911 nvc_info.c:403] missing compat32 library libEGL_nvidia.so W0708 09:39:49.726393 216911 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so W0708 09:39:49.726412 216911 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so W0708 09:39:49.726431 216911 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so W0708 09:39:49.726451 216911 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I0708 09:39:49.727264 216911 nvc_info.c:299] selecting /usr/bin/nvidia-smi I0708 09:39:49.727310 216911 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I0708 09:39:49.727351 216911 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I0708 09:39:49.727412 216911 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control I0708 09:39:49.727457 216911 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server W0708 09:39:49.727547 216911 nvc_info.c:425] missing binary nv-fabricmanager W0708 09:39:49.727605 216911 nvc_info.c:349] missing firmware path /lib/firmware/nvidia/418.67/gsp.bin I0708 09:39:49.727657 216911 nvc_info.c:529] listing device /dev/nvidiactl I0708 09:39:49.727680 216911 nvc_info.c:529] listing device /dev/nvidia-uvm I0708 09:39:49.727707 216911 nvc_info.c:529] listing device /dev/nvidia-uvm-tools I0708 09:39:49.727725 216911 nvc_info.c:529] listing device /dev/nvidia-modeset W0708 09:39:49.727771 216911 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket W0708 09:39:49.727817 216911 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket W0708 09:39:49.727855 216911 nvc_info.c:349] missing ipc path /tmp/nvidia-mps I0708 09:39:49.727877 216911 nvc_info.c:822] requesting device information with '' I0708 09:39:49.735809 216911 nvc_info.c:713] listing device /dev/nvidia0 (GPU-aa4958ab-ed48-381a-3a93-5e784dfdcbe3 at 00000000:3d:00.0) I0708 09:39:49.743712 216911 nvc_info.c:713] listing device /dev/nvidia1 (GPU-9027028f-6573-348a-b8d0-1b02c77e77ee at 00000000:3e:00.0) I0708 09:39:49.751827 216911 nvc_info.c:713] listing device /dev/nvidia2 (GPU-58b36068-2a19-94bc-3523-76716a618e1d at 00000000:3f:00.0) I0708 09:39:49.760136 216911 nvc_info.c:713] listing device /dev/nvidia3 (GPU-c0f16144-d313-62f4-51f0-80c2cad365ff at 00000000:40:00.0) I0708 09:39:49.768680 216911 nvc_info.c:713] listing device /dev/nvidia4 (GPU-633ee798-8ade-266f-64c4-f6acbc1d077f at 00000000:88:00.0) I0708 09:39:49.777572 216911 nvc_info.c:713] listing device /dev/nvidia5 (GPU-7388b983-d4b9-13dd-aa82-4703f809178b at 00000000:89:00.0) I0708 09:39:49.786559 216911 nvc_info.c:713] listing device /dev/nvidia6 (GPU-a7492dc8-5ef5-4554-ce0d-0e38299cade8 at 00000000:8a:00.0) I0708 09:39:49.795745 216911 nvc_info.c:713] listing device /dev/nvidia7 (GPU-3f299160-84f3-f5ab-a82d-1a95e8d41c7b at 00000000:8b:00.0) NVRM version: 418.67 CUDA version: 10.1

Device Index: 0 Device Minor: 0 Model: GeForce GTX 1080 Ti Brand: GeForce GPU UUID: GPU-aa4958ab-ed48-381a-3a93-5e784dfdcbe3 Bus Location: 00000000:3d:00.0 Architecture: 6.1

Device Index: 1 Device Minor: 1 Model: GeForce GTX 1080 Ti Brand: GeForce GPU UUID: GPU-9027028f-6573-348a-b8d0-1b02c77e77ee Bus Location: 00000000:3e:00.0 Architecture: 6.1

Device Index: 2 Device Minor: 2 Model: GeForce GTX 1080 Ti Brand: GeForce GPU UUID: GPU-58b36068-2a19-94bc-3523-76716a618e1d Bus Location: 00000000:3f:00.0 Architecture: 6.1

Device Index: 3 Device Minor: 3 Model: GeForce GTX 1080 Ti Brand: GeForce GPU UUID: GPU-c0f16144-d313-62f4-51f0-80c2cad365ff Bus Location: 00000000:40:00.0 Architecture: 6.1

Device Index: 4 Device Minor: 4 Model: GeForce GTX 1080 Ti Brand: GeForce GPU UUID: GPU-633ee798-8ade-266f-64c4-f6acbc1d077f Bus Location: 00000000:88:00.0 Architecture: 6.1

Device Index: 5 Device Minor: 5 Model: GeForce GTX 1080 Ti Brand: GeForce GPU UUID: GPU-7388b983-d4b9-13dd-aa82-4703f809178b Bus Location: 00000000:89:00.0 Architecture: 6.1

Device Index: 6 Device Minor: 6 Model: GeForce GTX 1080 Ti Brand: GeForce GPU UUID: GPU-a7492dc8-5ef5-4554-ce0d-0e38299cade8 Bus Location: 00000000:8a:00.0 Architecture: 6.1

Device Index: 7 Device Minor: 7 Model: GeForce GTX 1080 Ti Brand: GeForce GPU UUID: GPU-3f299160-84f3-f5ab-a82d-1a95e8d41c7b Bus Location: 00000000:8b:00.0 Architecture: 6.1 I0708 09:39:49.796464 216911 nvc.c:434] shutting down library context I0708 09:39:49.796526 216917 rpc.c:95] terminating nvcgo rpc service I0708 09:39:49.797025 216911 rpc.c:135] nvcgo rpc service terminated successfully I0708 09:39:49.800502 216913 rpc.c:95] terminating driver rpc service I0708 09:39:49.800676 216911 rpc.c:135] driver rpc service terminated successfully

Thanks

flx42 commented 2 years ago

This looks like an issue with libnvidia-container.

Did you try with docker and GPU support? If you can't test that, perhaps test the low-level example in the README of libnvidia-container: https://github.com/nvidia/libnvidia-container#command-line-example If it fails, you should file an issue against libnvidia-container directly.

3XX0 commented 2 years ago

You can probably see what's happening if you export NVIDIA_DEBUG_LOG=1 (see here) before starting the container. But as @flx42 mentioned it's most likely an issue with libnvidia-container rather than enroot

SolenoidWGT commented 2 years ago

thank @flx42

This looks like an issue with libnvidia-container.

Did you try with docker and GPU support?

Yes,I can start the container with docker normally and execute nvidia-smi.

If you can't test that, perhaps test the low-level example in the README of libnvidia-container: https://github.com/nvidia/libnvidia-container#command-line-example If it fails, you should file an issue against libnvidia-container directly.

nvidia-container-cli can execute the following commands normally: nvidia-container-cli --load-kmods configure --ldconfig=/usr/bin/ldconfig --no-cgroups --utility --device 0 $(pwd)

But when I execute the following command, I encountered an error: pivot_root . mnt pivot_root: failed to change root from '.' to 'mnt': Device or resource busy

SolenoidWGT commented 2 years ago

thank @3XX0

You can probably see what's happening if you export NVIDIA_DEBUG_LOG=1 (see here) before starting the container. But as @flx42 mentioned it's most likely an issue with libnvidia-container rather than enroot

When I export NVIDIA_DEBUG_LOG=1,

The fatal error is as follows:

I0711 07:13:39.545694 40071 nvc_ldcache.c:372] executing /usr/sbin/ldconfig from host at /root/.local/share/enroot/cuda E0711 07:13:39.552177 1 nvc_ldcache.c:403] could not start /usr/sbin/ldconfig: mount operation failed: /dev: operation not permitted

the full output is as follows:

enroot start --root --rw cuda sh -c 'pwd'

-- WARNING, the following logs are for debugging purposes only --

I0711 07:13:39.464482 40071 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae) I0711 07:13:39.464557 40071 nvc.c:350] using root / I0711 07:13:39.464579 40071 nvc.c:351] using ldcache /etc/ld.so.cache I0711 07:13:39.464598 40071 nvc.c:352] using unprivileged user 0:0 I0711 07:13:39.464633 40071 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0711 07:13:39.464751 40071 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment I0711 07:13:39.464870 40081 rpc.c:71] starting driver rpc service I0711 07:13:39.469141 40085 rpc.c:71] starting nvcgo rpc service I0711 07:13:39.470253 40071 nvc_container.c:240] configuring container with 'no-cgroups compute utility standalone' I0711 07:13:39.470403 40071 nvc_container.c:88] selecting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libcuda.so.410.129 I0711 07:13:39.470470 40071 nvc_container.c:88] selecting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libnvidia-fatbinaryloader.so.410.129 I0711 07:13:39.470511 40071 nvc_container.c:88] selecting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libnvidia-ptxjitcompiler.so.410.129 I0711 07:13:39.470552 40071 nvc_container.c:262] setting pid to 39993 I0711 07:13:39.470567 40071 nvc_container.c:263] setting rootfs to /root/.local/share/enroot/cuda I0711 07:13:39.470583 40071 nvc_container.c:264] setting owner to 0:0 I0711 07:13:39.470597 40071 nvc_container.c:265] setting bins directory to /usr/bin I0711 07:13:39.470610 40071 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu I0711 07:13:39.470626 40071 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu I0711 07:13:39.470642 40071 nvc_container.c:268] setting cudart directory to /usr/local/cuda I0711 07:13:39.470658 40071 nvc_container.c:269] setting ldconfig to @/usr/sbin/ldconfig (host relative) I0711 07:13:39.470673 40071 nvc_container.c:270] setting mount namespace to /root/.local/share/enroot/cuda/proc/39993/ns/mnt I0711 07:13:39.470694 40071 nvc_info.c:766] requesting driver information with '' I0711 07:13:39.472332 40071 nvc_info.c:173] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.418.67 I0711 07:13:39.472556 40071 nvc_info.c:173] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.418.67 I0711 07:13:39.472636 40071 nvc_info.c:173] selecting /usr/lib64/libnvidia-opticalflow.so.418.67 I0711 07:13:39.472709 40071 nvc_info.c:173] selecting /usr/lib64/libnvidia-opencl.so.418.67 I0711 07:13:39.472760 40071 nvc_info.c:173] selecting /usr/lib64/libnvidia-ml.so.418.67 I0711 07:13:39.472833 40071 nvc_info.c:173] selecting /usr/lib64/libnvidia-ifr.so.418.67 I0711 07:13:39.472908 40071 nvc_info.c:173] selecting /usr/lib64/libnvidia-fbc.so.418.67 I0711 07:13:39.472984 40071 nvc_info.c:173] selecting /usr/lib64/libnvidia-fatbinaryloader.so.418.67 I0711 07:13:39.473046 40071 nvc_info.c:173] selecting /usr/lib64/libnvidia-encode.so.418.67 I0711 07:13:39.473117 40071 nvc_info.c:173] selecting /usr/lib64/libnvidia-compiler.so.418.67 I0711 07:13:39.473169 40071 nvc_info.c:173] selecting /usr/lib64/libnvidia-cfg.so.418.67 I0711 07:13:39.473239 40071 nvc_info.c:173] selecting /usr/lib64/libnvcuvid.so.418.67 I0711 07:13:39.473533 40071 nvc_info.c:173] selecting /usr/lib64/libcuda.so.418.67 I0711 07:13:39.473766 40071 nvc_info.c:173] selecting /usr/lib/vdpau/libvdpau_nvidia.so.418.67 I0711 07:13:39.473819 40071 nvc_info.c:173] selecting /usr/lib/libnvidia-ptxjitcompiler.so.418.67 I0711 07:13:39.473892 40071 nvc_info.c:173] selecting /usr/lib/libnvidia-opticalflow.so.418.67 I0711 07:13:39.473968 40071 nvc_info.c:173] selecting /usr/lib/libnvidia-opencl.so.418.67 I0711 07:13:39.474021 40071 nvc_info.c:173] selecting /usr/lib/libnvidia-ml.so.418.67 I0711 07:13:39.474094 40071 nvc_info.c:173] selecting /usr/lib/libnvidia-ifr.so.418.67 I0711 07:13:39.474162 40071 nvc_info.c:173] selecting /usr/lib/libnvidia-fbc.so.418.67 I0711 07:13:39.474229 40071 nvc_info.c:173] selecting /usr/lib/libnvidia-fatbinaryloader.so.418.67 I0711 07:13:39.474278 40071 nvc_info.c:173] selecting /usr/lib/libnvidia-encode.so.418.67 I0711 07:13:39.474346 40071 nvc_info.c:173] selecting /usr/lib/libnvidia-compiler.so.418.67 I0711 07:13:39.474397 40071 nvc_info.c:173] selecting /usr/lib/libnvcuvid.so.418.67 I0711 07:13:39.474468 40071 nvc_info.c:173] selecting /usr/lib/libcuda.so.418.67 W0711 07:13:39.474520 40071 nvc_info.c:399] missing library libnvidia-nscq.so W0711 07:13:39.474538 40071 nvc_info.c:399] missing library libcudadebugger.so W0711 07:13:39.474552 40071 nvc_info.c:399] missing library libnvidia-allocator.so W0711 07:13:39.474567 40071 nvc_info.c:399] missing library libnvidia-pkcs11.so W0711 07:13:39.474580 40071 nvc_info.c:399] missing library libnvidia-ngx.so W0711 07:13:39.474595 40071 nvc_info.c:399] missing library libnvidia-eglcore.so W0711 07:13:39.474610 40071 nvc_info.c:399] missing library libnvidia-glcore.so W0711 07:13:39.474625 40071 nvc_info.c:399] missing library libnvidia-tls.so W0711 07:13:39.474640 40071 nvc_info.c:399] missing library libnvidia-glsi.so W0711 07:13:39.474655 40071 nvc_info.c:399] missing library libnvidia-rtcore.so W0711 07:13:39.474667 40071 nvc_info.c:399] missing library libnvoptix.so W0711 07:13:39.474681 40071 nvc_info.c:399] missing library libGLX_nvidia.so W0711 07:13:39.474693 40071 nvc_info.c:399] missing library libEGL_nvidia.so W0711 07:13:39.474708 40071 nvc_info.c:399] missing library libGLESv2_nvidia.so W0711 07:13:39.474723 40071 nvc_info.c:399] missing library libGLESv1_CM_nvidia.so W0711 07:13:39.474738 40071 nvc_info.c:399] missing library libnvidia-glvkspirv.so W0711 07:13:39.474751 40071 nvc_info.c:399] missing library libnvidia-cbl.so W0711 07:13:39.474766 40071 nvc_info.c:403] missing compat32 library libnvidia-cfg.so W0711 07:13:39.474780 40071 nvc_info.c:403] missing compat32 library libnvidia-nscq.so W0711 07:13:39.474794 40071 nvc_info.c:403] missing compat32 library libcudadebugger.so W0711 07:13:39.474807 40071 nvc_info.c:403] missing compat32 library libnvidia-allocator.so W0711 07:13:39.474823 40071 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so W0711 07:13:39.474838 40071 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W0711 07:13:39.474853 40071 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so W0711 07:13:39.474866 40071 nvc_info.c:403] missing compat32 library libnvidia-glcore.so W0711 07:13:39.474879 40071 nvc_info.c:403] missing compat32 library libnvidia-tls.so W0711 07:13:39.474893 40071 nvc_info.c:403] missing compat32 library libnvidia-glsi.so W0711 07:13:39.474908 40071 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W0711 07:13:39.474927 40071 nvc_info.c:403] missing compat32 library libnvoptix.so W0711 07:13:39.474939 40071 nvc_info.c:403] missing compat32 library libGLX_nvidia.so W0711 07:13:39.474954 40071 nvc_info.c:403] missing compat32 library libEGL_nvidia.so W0711 07:13:39.474968 40071 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so W0711 07:13:39.474982 40071 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so W0711 07:13:39.474995 40071 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so W0711 07:13:39.475010 40071 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I0711 07:13:39.475481 40071 nvc_info.c:299] selecting /usr/bin/nvidia-smi I0711 07:13:39.475518 40071 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I0711 07:13:39.475553 40071 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I0711 07:13:39.475606 40071 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control I0711 07:13:39.475642 40071 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server W0711 07:13:39.475821 40071 nvc_info.c:425] missing binary nv-fabricmanager W0711 07:13:39.475867 40071 nvc_info.c:349] missing firmware path /lib/firmware/nvidia/418.67/gsp.bin I0711 07:13:39.475911 40071 nvc_info.c:529] listing device /dev/nvidiactl I0711 07:13:39.475932 40071 nvc_info.c:529] listing device /dev/nvidia-uvm I0711 07:13:39.475949 40071 nvc_info.c:529] listing device /dev/nvidia-uvm-tools I0711 07:13:39.475965 40071 nvc_info.c:529] listing device /dev/nvidia-modeset W0711 07:13:39.476006 40071 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket W0711 07:13:39.476045 40071 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket W0711 07:13:39.476075 40071 nvc_info.c:349] missing ipc path /tmp/nvidia-mps I0711 07:13:39.476091 40071 nvc_info.c:822] requesting device information with '' I0711 07:13:39.483806 40071 nvc_info.c:713] listing device /dev/nvidia0 (GPU-076c32dc-9eda-53e9-07f2-78f4e0122817 at 00000000:04:00.0) I0711 07:13:39.491560 40071 nvc_info.c:713] listing device /dev/nvidia1 (GPU-65e5bffa-5410-a25b-fd6f-fd7b6cde9561 at 00000000:05:00.0) I0711 07:13:39.499501 40071 nvc_info.c:713] listing device /dev/nvidia2 (GPU-fec537e9-7eab-0769-8324-3a86feadc202 at 00000000:08:00.0) I0711 07:13:39.507685 40071 nvc_info.c:713] listing device /dev/nvidia3 (GPU-829c70e3-2700-6168-ab2f-390c29ae60f2 at 00000000:09:00.0) I0711 07:13:39.516084 40071 nvc_info.c:713] listing device /dev/nvidia4 (GPU-100098b3-3b99-bb2a-4ab3-39eed30672f0 at 00000000:86:00.0) I0711 07:13:39.524789 40071 nvc_info.c:713] listing device /dev/nvidia5 (GPU-be9ae89e-c64c-3869-2ceb-964e548c79f9 at 00000000:87:00.0) I0711 07:13:39.533548 40071 nvc_info.c:713] listing device /dev/nvidia6 (GPU-a91590f5-a416-95c9-5052-836bbd33a2c0 at 00000000:8a:00.0) I0711 07:13:39.542496 40071 nvc_info.c:713] listing device /dev/nvidia7 (GPU-f76a63fc-6750-1bd5-ddf2-fd956a81f9dc at 00000000:8b:00.0) I0711 07:13:39.542583 40071 nvc_mount.c:366] mounting tmpfs at /root/.local/share/enroot/cuda/proc/driver/nvidia I0711 07:13:39.542965 40071 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /root/.local/share/enroot/cuda/usr/bin/nvidia-smi I0711 07:13:39.543023 40071 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /root/.local/share/enroot/cuda/usr/bin/nvidia-debugdump I0711 07:13:39.543074 40071 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /root/.local/share/enroot/cuda/usr/bin/nvidia-persistenced I0711 07:13:39.543130 40071 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-control at /root/.local/share/enroot/cuda/usr/bin/nvidia-cuda-mps-control I0711 07:13:39.543180 40071 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-server at /root/.local/share/enroot/cuda/usr/bin/nvidia-cuda-mps-server I0711 07:13:39.543269 40071 nvc_mount.c:134] mounting /usr/lib64/libnvidia-ml.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.418.67 I0711 07:13:39.543321 40071 nvc_mount.c:134] mounting /usr/lib64/libnvidia-cfg.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.418.67 I0711 07:13:39.543380 40071 nvc_mount.c:134] mounting /usr/lib64/libcuda.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libcuda.so.418.67 I0711 07:13:39.543431 40071 nvc_mount.c:134] mounting /usr/lib64/libnvidia-opencl.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.418.67 I0711 07:13:39.543482 40071 nvc_mount.c:134] mounting /usr/lib64/libnvidia-ptxjitcompiler.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.418.67 I0711 07:13:39.543534 40071 nvc_mount.c:134] mounting /usr/lib64/libnvidia-fatbinaryloader.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.67 I0711 07:13:39.543585 40071 nvc_mount.c:134] mounting /usr/lib64/libnvidia-compiler.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.418.67 I0711 07:13:39.543614 40071 nvc_mount.c:527] creating symlink /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1 I0711 07:13:39.543717 40071 nvc_mount.c:134] mounting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libcuda.so.410.129 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libcuda.so.410.129 I0711 07:13:39.543772 40071 nvc_mount.c:134] mounting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libnvidia-fatbinaryloader.so.410.129 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 I0711 07:13:39.543826 40071 nvc_mount.c:134] mounting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libnvidia-ptxjitcompiler.so.410.129 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 I0711 07:13:39.543905 40071 nvc_mount.c:230] mounting /dev/nvidiactl at /root/.local/share/enroot/cuda/dev/nvidiactl I0711 07:13:39.544034 40071 nvc_mount.c:230] mounting /dev/nvidia-uvm at /root/.local/share/enroot/cuda/dev/nvidia-uvm I0711 07:13:39.544100 40071 nvc_mount.c:230] mounting /dev/nvidia-uvm-tools at /root/.local/share/enroot/cuda/dev/nvidia-uvm-tools I0711 07:13:39.544195 40071 nvc_mount.c:230] mounting /dev/nvidia0 at /root/.local/share/enroot/cuda/dev/nvidia0 I0711 07:13:39.544303 40071 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:04:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:04:00.0 I0711 07:13:39.544397 40071 nvc_mount.c:230] mounting /dev/nvidia1 at /root/.local/share/enroot/cuda/dev/nvidia1 I0711 07:13:39.544498 40071 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:05:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:05:00.0 I0711 07:13:39.544587 40071 nvc_mount.c:230] mounting /dev/nvidia2 at /root/.local/share/enroot/cuda/dev/nvidia2 I0711 07:13:39.544694 40071 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:08:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:08:00.0 I0711 07:13:39.544784 40071 nvc_mount.c:230] mounting /dev/nvidia3 at /root/.local/share/enroot/cuda/dev/nvidia3 I0711 07:13:39.544884 40071 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:09:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:09:00.0 I0711 07:13:39.544981 40071 nvc_mount.c:230] mounting /dev/nvidia4 at /root/.local/share/enroot/cuda/dev/nvidia4 I0711 07:13:39.545083 40071 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:86:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:86:00.0 I0711 07:13:39.545174 40071 nvc_mount.c:230] mounting /dev/nvidia5 at /root/.local/share/enroot/cuda/dev/nvidia5 I0711 07:13:39.545274 40071 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:87:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:87:00.0 I0711 07:13:39.545363 40071 nvc_mount.c:230] mounting /dev/nvidia6 at /root/.local/share/enroot/cuda/dev/nvidia6 I0711 07:13:39.545470 40071 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:8a:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:8a:00.0 I0711 07:13:39.545559 40071 nvc_mount.c:230] mounting /dev/nvidia7 at /root/.local/share/enroot/cuda/dev/nvidia7 I0711 07:13:39.545659 40071 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:8b:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:8b:00.0 I0711 07:13:39.545694 40071 nvc_ldcache.c:372] executing /usr/sbin/ldconfig from host at /root/.local/share/enroot/cuda E0711 07:13:39.552177 1 nvc_ldcache.c:403] could not start /usr/sbin/ldconfig: mount operation failed: /dev: operation not permitted nvidia-container-cli: ldcache error: process /usr/sbin/ldconfig failed with error code: 1 I0711 07:13:39.563032 40071 nvc.c:434] shutting down library context I0711 07:13:39.563099 40085 rpc.c:95] terminating nvcgo rpc service I0711 07:13:39.563417 40071 rpc.c:135] nvcgo rpc service terminated successfully I0711 07:13:39.570717 40081 rpc.c:95] terminating driver rpc service I0711 07:13:39.570829 40071 rpc.c:135] driver rpc service terminated successfully

elezar commented 2 years ago

@SolenoidWGT would you be able to check whether the NVIDIA Container CLI (libnvidia-container-tools and libnvidia-container1) v1.9.0 shows the same behaviour to determine whether this is a regression in 1.10.0?

Also, could you check the permissions on the /dev/nvidia* devices nodes on the host?

elezar commented 2 years ago

@klueska this may be caused by https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/141 which was already included in the v1.9.0 release.

klueska commented 2 years ago

Meaning this fix broke something for @SolenoidWGT

SolenoidWGT commented 2 years ago

Sorry for some mistakes in my reply, I will update later.

SolenoidWGT commented 2 years ago

Thanks @elezar

@SolenoidWGT would you be able to check whether the NVIDIA Container CLI (libnvidia-container-tools and libnvidia-container1) v1.9.0 shows the same behaviour to determine whether this is a regression in 1.10.0?

Also, could you check the permissions on the /dev/nvidia* devices nodes on the host?

yes , I installed version 1.9.0 of NVIDIA Container CLI by compiling from source, but the same error is still reported.

All the following commands are executed as root.

enroot start --root --rw cuda sh -c 'pwd'

-- WARNING, the following logs are for debugging purposes only --

I0713 12:22:17.206368 116125 nvc.c:376] initializing library context (version=1.9.0, build=5e135c17d6dbae861ec343e9a8d3a0d2af758a4f) I0713 12:22:17.206478 116125 nvc.c:350] using root / I0713 12:22:17.206511 116125 nvc.c:351] using ldcache /etc/ld.so.cache I0713 12:22:17.206532 116125 nvc.c:352] using unprivileged user 0:0 I0713 12:22:17.206565 116125 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0713 12:22:17.206700 116125 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment I0713 12:22:17.206881 116135 rpc.c:71] starting driver rpc service I0713 12:22:17.212438 116139 rpc.c:71] starting nvcgo rpc service I0713 12:22:17.214007 116125 nvc_container.c:240] configuring container with 'no-cgroups compute utility standalone' I0713 12:22:17.214237 116125 nvc_container.c:88] selecting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libcuda.so.410.129 I0713 12:22:17.214329 116125 nvc_container.c:88] selecting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libnvidia-fatbinaryloader.so.410.129 I0713 12:22:17.214387 116125 nvc_container.c:88] selecting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libnvidia-ptxjitcompiler.so.410.129 I0713 12:22:17.214443 116125 nvc_container.c:262] setting pid to 116047 I0713 12:22:17.214465 116125 nvc_container.c:263] setting rootfs to /root/.local/share/enroot/cuda I0713 12:22:17.214482 116125 nvc_container.c:264] setting owner to 0:0 I0713 12:22:17.214502 116125 nvc_container.c:265] setting bins directory to /usr/bin I0713 12:22:17.214522 116125 nvc_container.c:266] setting libs directory to /usr/lib/x86_64-linux-gnu I0713 12:22:17.214543 116125 nvc_container.c:267] setting libs32 directory to /usr/lib/i386-linux-gnu I0713 12:22:17.214563 116125 nvc_container.c:268] setting cudart directory to /usr/local/cuda I0713 12:22:17.214582 116125 nvc_container.c:269] setting ldconfig to @/usr/sbin/ldconfig.real (host relative) I0713 12:22:17.214600 116125 nvc_container.c:270] setting mount namespace to /root/.local/share/enroot/cuda/proc/116047/ns/mnt I0713 12:22:17.214629 116125 nvc_info.c:765] requesting driver information with '' I0713 12:22:17.216767 116125 nvc_info.c:172] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.418.67 I0713 12:22:17.217015 116125 nvc_info.c:172] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.418.67 I0713 12:22:17.217116 116125 nvc_info.c:172] selecting /usr/lib64/libnvidia-opticalflow.so.418.67 I0713 12:22:17.217224 116125 nvc_info.c:172] selecting /usr/lib64/libnvidia-opencl.so.418.67 I0713 12:22:17.217292 116125 nvc_info.c:172] selecting /usr/lib64/libnvidia-ml.so.418.67 I0713 12:22:17.217581 116125 nvc_info.c:172] selecting /usr/lib64/libnvidia-ifr.so.418.67 I0713 12:22:17.217680 116125 nvc_info.c:172] selecting /usr/lib64/libnvidia-fbc.so.418.67 I0713 12:22:17.217773 116125 nvc_info.c:172] selecting /usr/lib64/libnvidia-fatbinaryloader.so.418.67 I0713 12:22:17.217840 116125 nvc_info.c:172] selecting /usr/lib64/libnvidia-encode.so.418.67 I0713 12:22:17.217938 116125 nvc_info.c:172] selecting /usr/lib64/libnvidia-compiler.so.418.67 I0713 12:22:17.218006 116125 nvc_info.c:172] selecting /usr/lib64/libnvidia-cfg.so.418.67 I0713 12:22:17.218103 116125 nvc_info.c:172] selecting /usr/lib64/libnvcuvid.so.418.67 I0713 12:22:17.218428 116125 nvc_info.c:172] selecting /usr/lib64/libcuda.so.418.67 I0713 12:22:17.218683 116125 nvc_info.c:172] selecting /usr/lib/vdpau/libvdpau_nvidia.so.418.67 I0713 12:22:17.218753 116125 nvc_info.c:172] selecting /usr/lib/libnvidia-ptxjitcompiler.so.418.67 I0713 12:22:17.218852 116125 nvc_info.c:172] selecting /usr/lib/libnvidia-opticalflow.so.418.67 I0713 12:22:17.218947 116125 nvc_info.c:172] selecting /usr/lib/libnvidia-opencl.so.418.67 I0713 12:22:17.219014 116125 nvc_info.c:172] selecting /usr/lib/libnvidia-ml.so.418.67 I0713 12:22:17.219111 116125 nvc_info.c:172] selecting /usr/lib/libnvidia-ifr.so.418.67 I0713 12:22:17.219214 116125 nvc_info.c:172] selecting /usr/lib/libnvidia-fbc.so.418.67 I0713 12:22:17.219308 116125 nvc_info.c:172] selecting /usr/lib/libnvidia-fatbinaryloader.so.418.67 I0713 12:22:17.219373 116125 nvc_info.c:172] selecting /usr/lib/libnvidia-encode.so.418.67 I0713 12:22:17.219465 116125 nvc_info.c:172] selecting /usr/lib/libnvidia-compiler.so.418.67 I0713 12:22:17.219532 116125 nvc_info.c:172] selecting /usr/lib/libnvcuvid.so.418.67 I0713 12:22:17.219629 116125 nvc_info.c:172] selecting /usr/lib/libcuda.so.418.67 W0713 12:22:17.219698 116125 nvc_info.c:398] missing library libnvidia-nscq.so W0713 12:22:17.219719 116125 nvc_info.c:398] missing library libnvidia-allocator.so W0713 12:22:17.219743 116125 nvc_info.c:398] missing library libnvidia-pkcs11.so W0713 12:22:17.219764 116125 nvc_info.c:398] missing library libnvidia-ngx.so W0713 12:22:17.219782 116125 nvc_info.c:398] missing library libnvidia-eglcore.so W0713 12:22:17.219800 116125 nvc_info.c:398] missing library libnvidia-glcore.so W0713 12:22:17.219819 116125 nvc_info.c:398] missing library libnvidia-tls.so W0713 12:22:17.219839 116125 nvc_info.c:398] missing library libnvidia-glsi.so W0713 12:22:17.219857 116125 nvc_info.c:398] missing library libnvidia-rtcore.so W0713 12:22:17.219876 116125 nvc_info.c:398] missing library libnvoptix.so W0713 12:22:17.219896 116125 nvc_info.c:398] missing library libGLX_nvidia.so W0713 12:22:17.219915 116125 nvc_info.c:398] missing library libEGL_nvidia.so W0713 12:22:17.219933 116125 nvc_info.c:398] missing library libGLESv2_nvidia.so W0713 12:22:17.219951 116125 nvc_info.c:398] missing library libGLESv1_CM_nvidia.so W0713 12:22:17.219971 116125 nvc_info.c:398] missing library libnvidia-glvkspirv.so W0713 12:22:17.219991 116125 nvc_info.c:398] missing library libnvidia-cbl.so W0713 12:22:17.220010 116125 nvc_info.c:402] missing compat32 library libnvidia-cfg.so W0713 12:22:17.220028 116125 nvc_info.c:402] missing compat32 library libnvidia-nscq.so W0713 12:22:17.220048 116125 nvc_info.c:402] missing compat32 library libnvidia-allocator.so W0713 12:22:17.220066 116125 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so W0713 12:22:17.220088 116125 nvc_info.c:402] missing compat32 library libnvidia-ngx.so W0713 12:22:17.220107 116125 nvc_info.c:402] missing compat32 library libnvidia-eglcore.so W0713 12:22:17.220126 116125 nvc_info.c:402] missing compat32 library libnvidia-glcore.so W0713 12:22:17.220144 116125 nvc_info.c:402] missing compat32 library libnvidia-tls.so W0713 12:22:17.220163 116125 nvc_info.c:402] missing compat32 library libnvidia-glsi.so W0713 12:22:17.220180 116125 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so W0713 12:22:17.220205 116125 nvc_info.c:402] missing compat32 library libnvoptix.so W0713 12:22:17.220225 116125 nvc_info.c:402] missing compat32 library libGLX_nvidia.so W0713 12:22:17.220242 116125 nvc_info.c:402] missing compat32 library libEGL_nvidia.so W0713 12:22:17.220262 116125 nvc_info.c:402] missing compat32 library libGLESv2_nvidia.so W0713 12:22:17.220280 116125 nvc_info.c:402] missing compat32 library libGLESv1_CM_nvidia.so W0713 12:22:17.220300 116125 nvc_info.c:402] missing compat32 library libnvidia-glvkspirv.so W0713 12:22:17.220320 116125 nvc_info.c:402] missing compat32 library libnvidia-cbl.so I0713 12:22:17.221602 116125 nvc_info.c:298] selecting /usr/bin/nvidia-smi I0713 12:22:17.221654 116125 nvc_info.c:298] selecting /usr/bin/nvidia-debugdump I0713 12:22:17.221702 116125 nvc_info.c:298] selecting /usr/bin/nvidia-persistenced I0713 12:22:17.221776 116125 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-control I0713 12:22:17.221824 116125 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-server W0713 12:22:17.222096 116125 nvc_info.c:424] missing binary nv-fabricmanager W0713 12:22:17.222162 116125 nvc_info.c:348] missing firmware path /lib/firmware/nvidia/418.67/gsp.bin I0713 12:22:17.222225 116125 nvc_info.c:528] listing device /dev/nvidiactl I0713 12:22:17.222245 116125 nvc_info.c:528] listing device /dev/nvidia-uvm I0713 12:22:17.222266 116125 nvc_info.c:528] listing device /dev/nvidia-uvm-tools I0713 12:22:17.222283 116125 nvc_info.c:528] listing device /dev/nvidia-modeset W0713 12:22:17.222340 116125 nvc_info.c:348] missing ipc path /var/run/nvidia-persistenced/socket W0713 12:22:17.222392 116125 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket W0713 12:22:17.222432 116125 nvc_info.c:348] missing ipc path /tmp/nvidia-mps I0713 12:22:17.222451 116125 nvc_info.c:821] requesting device information with '' I0713 12:22:17.230379 116125 nvc_info.c:712] listing device /dev/nvidia0 (GPU-aa4958ab-ed48-381a-3a93-5e784dfdcbe3 at 00000000:3d:00.0) I0713 12:22:17.238298 116125 nvc_info.c:712] listing device /dev/nvidia1 (GPU-9027028f-6573-348a-b8d0-1b02c77e77ee at 00000000:3e:00.0) I0713 12:22:17.246433 116125 nvc_info.c:712] listing device /dev/nvidia2 (GPU-58b36068-2a19-94bc-3523-76716a618e1d at 00000000:3f:00.0) I0713 12:22:17.254773 116125 nvc_info.c:712] listing device /dev/nvidia3 (GPU-c0f16144-d313-62f4-51f0-80c2cad365ff at 00000000:40:00.0) I0713 12:22:17.263360 116125 nvc_info.c:712] listing device /dev/nvidia4 (GPU-633ee798-8ade-266f-64c4-f6acbc1d077f at 00000000:88:00.0) I0713 12:22:17.272155 116125 nvc_info.c:712] listing device /dev/nvidia5 (GPU-7388b983-d4b9-13dd-aa82-4703f809178b at 00000000:89:00.0) I0713 12:22:17.281165 116125 nvc_info.c:712] listing device /dev/nvidia6 (GPU-a7492dc8-5ef5-4554-ce0d-0e38299cade8 at 00000000:8a:00.0) I0713 12:22:17.290396 116125 nvc_info.c:712] listing device /dev/nvidia7 (GPU-3f299160-84f3-f5ab-a82d-1a95e8d41c7b at 00000000:8b:00.0) I0713 12:22:17.290498 116125 nvc_mount.c:366] mounting tmpfs at /root/.local/share/enroot/cuda/proc/driver/nvidia I0713 12:22:17.291007 116125 nvc_mount.c:134] mounting /usr/bin/nvidia-smi at /root/.local/share/enroot/cuda/usr/bin/nvidia-smi I0713 12:22:17.291092 116125 nvc_mount.c:134] mounting /usr/bin/nvidia-debugdump at /root/.local/share/enroot/cuda/usr/bin/nvidia-debugdump I0713 12:22:17.291172 116125 nvc_mount.c:134] mounting /usr/bin/nvidia-persistenced at /root/.local/share/enroot/cuda/usr/bin/nvidia-persistenced I0713 12:22:17.291258 116125 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-control at /root/.local/share/enroot/cuda/usr/bin/nvidia-cuda-mps-control I0713 12:22:17.291337 116125 nvc_mount.c:134] mounting /usr/bin/nvidia-cuda-mps-server at /root/.local/share/enroot/cuda/usr/bin/nvidia-cuda-mps-server I0713 12:22:17.291472 116125 nvc_mount.c:134] mounting /usr/lib64/libnvidia-ml.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.418.67 I0713 12:22:17.291551 116125 nvc_mount.c:134] mounting /usr/lib64/libnvidia-cfg.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.418.67 I0713 12:22:17.291629 116125 nvc_mount.c:134] mounting /usr/lib64/libcuda.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libcuda.so.418.67 I0713 12:22:17.291709 116125 nvc_mount.c:134] mounting /usr/lib64/libnvidia-opencl.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.418.67 I0713 12:22:17.291788 116125 nvc_mount.c:134] mounting /usr/lib64/libnvidia-ptxjitcompiler.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.418.67 I0713 12:22:17.291866 116125 nvc_mount.c:134] mounting /usr/lib64/libnvidia-fatbinaryloader.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.418.67 I0713 12:22:17.291944 116125 nvc_mount.c:134] mounting /usr/lib64/libnvidia-compiler.so.418.67 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.418.67 I0713 12:22:17.291987 116125 nvc_mount.c:527] creating symlink /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1 I0713 12:22:17.292137 116125 nvc_mount.c:134] mounting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libcuda.so.410.129 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libcuda.so.410.129 I0713 12:22:17.292224 116125 nvc_mount.c:134] mounting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libnvidia-fatbinaryloader.so.410.129 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-fatbinaryloader.so.410.129 I0713 12:22:17.292303 116125 nvc_mount.c:134] mounting /root/.local/share/enroot/cuda/usr/local/cuda-10.0/compat/libnvidia-ptxjitcompiler.so.410.129 at /root/.local/share/enroot/cuda/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.410.129 I0713 12:22:17.292409 116125 nvc_mount.c:230] mounting /dev/nvidiactl at /root/.local/share/enroot/cuda/dev/nvidiactl I0713 12:22:17.292570 116125 nvc_mount.c:230] mounting /dev/nvidia-uvm at /root/.local/share/enroot/cuda/dev/nvidia-uvm I0713 12:22:17.292665 116125 nvc_mount.c:230] mounting /dev/nvidia-uvm-tools at /root/.local/share/enroot/cuda/dev/nvidia-uvm-tools I0713 12:22:17.292795 116125 nvc_mount.c:230] mounting /dev/nvidia0 at /root/.local/share/enroot/cuda/dev/nvidia0 I0713 12:22:17.292951 116125 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:3d:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:3d:00.0 I0713 12:22:17.293079 116125 nvc_mount.c:230] mounting /dev/nvidia1 at /root/.local/share/enroot/cuda/dev/nvidia1 I0713 12:22:17.293231 116125 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:3e:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:3e:00.0 I0713 12:22:17.293356 116125 nvc_mount.c:230] mounting /dev/nvidia2 at /root/.local/share/enroot/cuda/dev/nvidia2 I0713 12:22:17.293502 116125 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:3f:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:3f:00.0 I0713 12:22:17.293625 116125 nvc_mount.c:230] mounting /dev/nvidia3 at /root/.local/share/enroot/cuda/dev/nvidia3 I0713 12:22:17.293767 116125 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:40:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:40:00.0 I0713 12:22:17.293892 116125 nvc_mount.c:230] mounting /dev/nvidia4 at /root/.local/share/enroot/cuda/dev/nvidia4 I0713 12:22:17.294036 116125 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:88:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:88:00.0 I0713 12:22:17.294160 116125 nvc_mount.c:230] mounting /dev/nvidia5 at /root/.local/share/enroot/cuda/dev/nvidia5 I0713 12:22:17.294308 116125 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:89:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:89:00.0 I0713 12:22:17.294433 116125 nvc_mount.c:230] mounting /dev/nvidia6 at /root/.local/share/enroot/cuda/dev/nvidia6 I0713 12:22:17.294577 116125 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:8a:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:8a:00.0 I0713 12:22:17.294701 116125 nvc_mount.c:230] mounting /dev/nvidia7 at /root/.local/share/enroot/cuda/dev/nvidia7 I0713 12:22:17.294844 116125 nvc_mount.c:440] mounting /proc/driver/nvidia/gpus/0000:8b:00.0 at /root/.local/share/enroot/cuda/proc/driver/nvidia/gpus/0000:8b:00.0 I0713 12:22:17.294899 116125 nvc_ldcache.c:372] executing /usr/sbin/ldconfig.real from host at /root/.local/share/enroot/cuda E0713 12:22:17.302952 1 nvc_ldcache.c:403] could not start /usr/sbin/ldconfig.real: mount operation failed: /dev: operation not permitted nvidia-container-cli: ldcache error: process /usr/sbin/ldconfig.real failed with error code: 1 I0713 12:22:17.317543 116125 nvc.c:430] shutting down library context I0713 12:22:17.317631 116139 rpc.c:95] terminating nvcgo rpc service I0713 12:22:17.318291 116125 rpc.c:135] nvcgo rpc service terminated successfully I0713 12:22:17.327402 116135 rpc.c:95] terminating driver rpc service I0713 12:22:17.327577 116125 rpc.c:135] driver rpc service terminated successfully [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

I also check the permissions on the /dev/nvidia devices, However, everything seems to be ok.

ll /dev/nvidia*

crw-rw-rw- 1 root root 195, 0 Jun 9 16:40 /dev/nvidia0 crw-rw-rw- 1 root root 195, 1 Jun 9 16:40 /dev/nvidia1 crw-rw-rw- 1 root root 195, 2 Jun 9 16:40 /dev/nvidia2 crw-rw-rw- 1 root root 195, 3 Jun 9 16:40 /dev/nvidia3 crw-rw-rw- 1 root root 195, 4 Jun 9 16:40 /dev/nvidia4 crw-rw-rw- 1 root root 195, 5 Jun 9 16:40 /dev/nvidia5 crw-rw-rw- 1 root root 195, 6 Jun 9 16:40 /dev/nvidia6 crw-rw-rw- 1 root root 195, 7 Jun 9 16:40 /dev/nvidia7 crw-rw-rw- 1 root root 195, 255 Jun 9 16:40 /dev/nvidiactl crw-rw-rw- 1 root root 195, 254 Jul 8 17:39 /dev/nvidia-modeset crw-rw-rw- 1 root root 230, 0 Jun 24 15:59 /dev/nvidia-uvm crw-rw-rw- 1 root root 230, 1 Jun 24 15:59 /dev/nvidia-uvm-tools `

nvidia-smi

Wed Jul 13 16:26:15 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... On | 00000000:3D:00.0 Off | N/A | | 29% 28C P8 9W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 GeForce GTX 108... On | 00000000:3E:00.0 Off | N/A | | 29% 28C P8 9W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 GeForce GTX 108... On | 00000000:3F:00.0 Off | N/A | | 29% 28C P8 8W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 GeForce GTX 108... On | 00000000:40:00.0 Off | N/A | | 29% 30C P8 8W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 4 GeForce GTX 108... On | 00000000:88:00.0 Off | N/A | | 29% 28C P8 8W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 5 GeForce GTX 108... On | 00000000:89:00.0 Off | N/A | | 29% 29C P8 8W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 6 GeForce GTX 108... On | 00000000:8A:00.0 Off | N/A | | 29% 28C P8 7W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 7 GeForce GTX 108... On | 00000000:8B:00.0 Off | N/A | | 29% 30C P8 9W / 250W | 0MiB / 11178MiB | 0% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Here are the versions of libnvidia-container-toolsandlibnvidia-container1.

Although I compiled and installed nvidia-container-runtime from container-toolkit with tag v1.9.0, its version is still displayed as 1.1.2. I don't know if this is a misunderstanding?

# nvidia-container-runtime --version

runc version 1.1.2 commit: v1.1.2-0-ga916309 spec: 1.0.2-dev go: go1.17.11 libseccomp: 2.3.1

# nvidia-container-cli --version

cli-version: 1.9.0 lib-version: 1.9.0 build date: 2022-07-12T13:02+0000 build revision: 5e135c17d6dbae861ec343e9a8d3a0d2af758a4f build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-44) build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

elezar commented 2 years ago

@SolenoidWGT I have as yet not been able to reproduce this locally. Our current hypothesis is that this is caused by the code here which is triggered when changing the root and was added to the 1.9.0 release to address an issue on some Debian systems.

Would you be able to:

  1. Repeat your test with v1.8.1 or off a later version without the remount of /dev as indicated by the linked code\
  2. Provide the contents of the folder which is to be the container root. (Are there any /dev/ nodes there already, for example).
  3. Provide any additional information (e.g. strace output or SELinux logs if applicable).

@flx42 @3XX0 would you be able to provide access to a centos7 system so that I can debug this further?