NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.44k stars 259 forks source link

nvidia-container-cli: mount error: namespace association failed #296

Open CodesFarmer opened 3 years ago

CodesFarmer commented 3 years ago

I have start a container with command like nvidia-docker run -d --name xxx -v /path/to/data:/container/path -it xxx:latest /bin/bash I successfully started the container from image at first time, and I can use it with GPU Then I try to enter the container with command docer exec -it container_name /bin/bash the docker remind me that

nvidia-container-cli: mount error: namespace association failed: /proc/18331/ns/mnt: function not implemented

Then I stopped the container and try to start a new one with exactly same command above, but still have the error And the command docker run -d --name xxx -v /path/to/data:/container/path -it xxx:latest /bin/bash can start a container successfully but the GPU is not available

I searched all possible keywords try to solve the problem, but there is no one works

My version of nvidia-docker is 17.06.2-ce the version of nvidia-container-cli is version: 1.0.0 build date: 2018-01-11T00:23+0000 build revision: 4a618459e8ba522d834bb2b4c665847fae8ce0ad build compiler: gcc 4.8.5 20150623 (Red Hat 4.8.5-16) build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

The error while start a container with nvidia-docker is

docker: Error response from daemon: oci runtime error: container_linux.go:296: starting container process caused "process_linux.go:398: container init caused \"process_linux.go:381: running prestart hook 1 caused \\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig --device=all --compute --utility --require=cuda>=9.0 --pid=18331 /home/work/docker/devicemapper/mnt/9d23ec20f7616ee3d9b07bbdf8f411165f801cb062864530f4aa755064a950b2/rootfs]\\nnvidia-container-cli: mount error: namespace association failed: /proc/18331/ns/mnt: function not implemented\\n\\"\""

The output with command nvidia-container-cli -k -d /dev/tty info is (only part of it): I1118 09:37:52.660599 77880 nvc_info.c:491] listing device /dev/nvidia7 (GPU-284781f8-551f-360d-bdcc-692edf6533f2 at 00000000:00:0f.0) NVRM version: 440.33.01 CUDA version: 10.2

Device Index: 0 Model: Tesla V100-SXM2-32GB GPU UUID: GPU-ce2743d2-b6e3-e574-2e44-5ba47e7f0627 Bus Location: 00000000:00:08.0 Architecture: 7.0

Any advises or solution?

klueska commented 3 years ago

Can you enable logging for the nvidia-container-cli in /etc/nvidia-container-runtime/cxonfig.toml and show me the output of it after (1) starting the first container, and (2) attempting the exec.

I.e.:

rm -rf /var /var/log/nvidia-container-toolkit.log
nvidia-docker run -d --name xxx -v /path/to/data:/container/path -it xxx:latest /bin/bash
docker exec -it container_name /bin/bash
cat /var /var/log/nvidia-container-toolkit.log
klueska commented 3 years ago

Ah, one thing I just noticed:

My version of nvidia-docker is 17.06.2-ce

We only support version 18.09+. https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#container-runtimes

Could be part of the issue.

TomorrowIsAnOtherDay commented 2 years ago

have the same problem. Anyone have found the solution?

TomorrowIsAnOtherDay commented 2 years ago

@klueska Hi, this is the log. Can you have a look at it? Any suggestion?

-- WARNING, the following logs are for debugging purposes only --

I1104 16:00:48.626793 212076 nvc.c:372] initializing library context (version=1.5.1, build=4afad130c4c253abd3b2db563ffe9331594bda41)
I1104 16:00:48.626860 212076 nvc.c:346] using root /
I1104 16:00:48.626870 212076 nvc.c:347] using ldcache /etc/ld.so.cache
I1104 16:00:48.626874 212076 nvc.c:348] using unprivileged user 65534:65534
I1104 16:00:48.626895 212076 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1104 16:00:48.626944 212076 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment
I1104 16:00:48.632335 212089 nvc.c:274] loading kernel module nvidia
I1104 16:00:48.633285 212089 nvc.c:278] running mknod for /dev/nvidiactl
I1104 16:00:48.633338 212089 nvc.c:282] running mknod for /dev/nvidia0
I1104 16:00:48.633370 212089 nvc.c:282] running mknod for /dev/nvidia1
I1104 16:00:48.633410 212089 nvc.c:282] running mknod for /dev/nvidia2
I1104 16:00:48.633437 212089 nvc.c:282] running mknod for /dev/nvidia3
I1104 16:00:48.633462 212089 nvc.c:282] running mknod for /dev/nvidia4
I1104 16:00:48.633487 212089 nvc.c:282] running mknod for /dev/nvidia5
I1104 16:00:48.633511 212089 nvc.c:282] running mknod for /dev/nvidia6
I1104 16:00:48.633536 212089 nvc.c:282] running mknod for /dev/nvidia7
I1104 16:00:48.633561 212089 nvc.c:286] running mknod for all nvcaps in /dev/nvidia-caps
I1104 16:00:48.633572 212089 nvc.c:292] loading kernel module nvidia_uvm
I1104 16:00:48.633789 212089 nvc.c:296] running mknod for /dev/nvidia-uvm
I1104 16:00:48.633854 212089 nvc.c:301] loading kernel module nvidia_modeset
I1104 16:00:48.634062 212089 nvc.c:305] running mknod for /dev/nvidia-modeset
I1104 16:00:48.634325 212117 driver.c:101] starting driver service
I1104 16:00:48.638835 212076 nvc_container.c:388] configuring container with 'compute utility supervised'
I1104 16:00:48.639292 212076 nvc_container.c:236] selecting /home/work/docker/overlay2/916b001d09030101a5ec069e0e6676d51a521b806dfc2a3b90ae177b67dc125b/merged/usr/local/cuda-10.1/compat/libcuda.so.418.87.01
I1104 16:00:48.639399 212076 nvc_container.c:236] selecting /home/work/docker/overlay2/916b001d09030101a5ec069e0e6676d51a521b806dfc2a3b90ae177b67dc125b/merged/usr/local/cuda-10.1/compat/libnvidia-fatbinaryloader.so.418.87.01
I1104 16:00:48.639453 212076 nvc_container.c:236] selecting /home/work/docker/overlay2/916b001d09030101a5ec069e0e6676d51a521b806dfc2a3b90ae177b67dc125b/merged/usr/local/cuda-10.1/compat/libnvidia-ptxjitcompiler.so.418.87.01
I1104 16:00:48.639660 212076 nvc_container.c:408] setting pid to 212056
I1104 16:00:48.639664 212076 nvc_container.c:409] setting rootfs to /home/work/docker/overlay2/916b001d09030101a5ec069e0e6676d51a521b806dfc2a3b90ae177b67dc125b/merged
I1104 16:00:48.639669 212076 nvc_container.c:410] setting owner to 0:0
I1104 16:00:48.639672 212076 nvc_container.c:411] setting bins directory to /usr/bin
I1104 16:00:48.639676 212076 nvc_container.c:412] setting libs directory to /usr/lib/x86_64-linux-gnu
I1104 16:00:48.639680 212076 nvc_container.c:413] setting libs32 directory to /usr/lib/i386-linux-gnu
I1104 16:00:48.639684 212076 nvc_container.c:414] setting cudart directory to /usr/local/cuda
I1104 16:00:48.639687 212076 nvc_container.c:415] setting ldconfig to @/sbin/ldconfig (host relative)
I1104 16:00:48.639691 212076 nvc_container.c:416] setting mount namespace to /proc/212056/ns/mnt
I1104 16:00:48.639694 212076 nvc_container.c:418] setting devices cgroup to /cgroups/devices/docker/deb042ef469527cdbb60ad9ebfa7aed96e3c34d86c6edc4a6c3ce52198d6a320
I1104 16:00:48.639702 212076 nvc_info.c:758] requesting driver information with ''
I1104 16:00:48.641215 212076 nvc_info.c:171] selecting /usr/lib64/vdpau/libvdpau_nvidia.so.440.33.01
I1104 16:00:48.641375 212076 nvc_info.c:171] selecting /usr/lib64/libnvoptix.so.440.33.01
I1104 16:00:48.641423 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-tls.so.440.33.01
I1104 16:00:48.641450 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-rtcore.so.440.33.01
I1104 16:00:48.641477 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.440.33.01
I1104 16:00:48.641514 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-opticalflow.so.440.33.01
I1104 16:00:48.641550 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-opencl.so.440.33.01
I1104 16:00:48.641573 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-ml.so.440.33.01
I1104 16:00:48.641611 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-ifr.so.440.33.01
I1104 16:00:48.641649 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-glvkspirv.so.440.33.01
I1104 16:00:48.641677 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-glsi.so.440.33.01
I1104 16:00:48.641702 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-glcore.so.440.33.01
I1104 16:00:48.641727 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-fbc.so.440.33.01
I1104 16:00:48.641764 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-fatbinaryloader.so.440.33.01
I1104 16:00:48.641800 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-encode.so.440.33.01
I1104 16:00:48.641837 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-eglcore.so.440.33.01
I1104 16:00:48.641863 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-compiler.so.440.33.01
I1104 16:00:48.641896 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-cfg.so.440.33.01
I1104 16:00:48.641930 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-cbl.so.440.33.01
I1104 16:00:48.641954 212076 nvc_info.c:171] selecting /usr/lib64/libnvidia-allocator.so.440.33.01
I1104 16:00:48.641989 212076 nvc_info.c:171] selecting /usr/lib64/libnvcuvid.so.440.33.01
I1104 16:00:48.642256 212076 nvc_info.c:171] selecting /usr/lib64/libcuda.so.440.33.01
I1104 16:00:48.642438 212076 nvc_info.c:171] selecting /usr/lib64/libGLX_nvidia.so.440.33.01
I1104 16:00:48.642465 212076 nvc_info.c:171] selecting /usr/lib64/libGLESv2_nvidia.so.440.33.01
I1104 16:00:48.642489 212076 nvc_info.c:171] selecting /usr/lib64/libGLESv1_CM_nvidia.so.440.33.01
I1104 16:00:48.642517 212076 nvc_info.c:171] selecting /usr/lib64/libEGL_nvidia.so.440.33.01
I1104 16:00:48.642550 212076 nvc_info.c:171] selecting /usr/lib/vdpau/libvdpau_nvidia.so.440.33.01
I1104 16:00:48.642593 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-tls.so.440.33.01
I1104 16:00:48.642621 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-ptxjitcompiler.so.440.33.01
I1104 16:00:48.642659 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-opticalflow.so.440.33.01
I1104 16:00:48.642695 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-opencl.so.440.33.01
I1104 16:00:48.642719 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-ml.so.440.33.01
I1104 16:00:48.642756 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-ifr.so.440.33.01
I1104 16:00:48.642793 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-glvkspirv.so.440.33.01
I1104 16:00:48.642817 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-glsi.so.440.33.01
I1104 16:00:48.642839 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-glcore.so.440.33.01
I1104 16:00:48.642864 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-fbc.so.440.33.01
I1104 16:00:48.642902 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-fatbinaryloader.so.440.33.01
I1104 16:00:48.642925 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-encode.so.440.33.01
I1104 16:00:48.642960 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-eglcore.so.440.33.01
I1104 16:00:48.642983 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-compiler.so.440.33.01
I1104 16:00:48.643008 212076 nvc_info.c:171] selecting /usr/lib/libnvidia-allocator.so.440.33.01
I1104 16:00:48.643045 212076 nvc_info.c:171] selecting /usr/lib/libnvcuvid.so.440.33.01
I1104 16:00:48.643090 212076 nvc_info.c:171] selecting /usr/lib/libcuda.so.440.33.01
I1104 16:00:48.643133 212076 nvc_info.c:171] selecting /usr/lib/libGLX_nvidia.so.440.33.01
I1104 16:00:48.643157 212076 nvc_info.c:171] selecting /usr/lib/libGLESv2_nvidia.so.440.33.01
I1104 16:00:48.643180 212076 nvc_info.c:171] selecting /usr/lib/libGLESv1_CM_nvidia.so.440.33.01
I1104 16:00:48.643206 212076 nvc_info.c:171] selecting /usr/lib/libEGL_nvidia.so.440.33.01
W1104 16:00:48.643221 212076 nvc_info.c:397] missing library libnvidia-nscq.so
W1104 16:00:48.643225 212076 nvc_info.c:397] missing library libnvidia-ngx.so
W1104 16:00:48.643229 212076 nvc_info.c:401] missing compat32 library libnvidia-cfg.so
W1104 16:00:48.643232 212076 nvc_info.c:401] missing compat32 library libnvidia-nscq.so
W1104 16:00:48.643235 212076 nvc_info.c:401] missing compat32 library libnvidia-ngx.so
W1104 16:00:48.643239 212076 nvc_info.c:401] missing compat32 library libnvidia-rtcore.so
W1104 16:00:48.643242 212076 nvc_info.c:401] missing compat32 library libnvoptix.so
W1104 16:00:48.643246 212076 nvc_info.c:401] missing compat32 library libnvidia-cbl.so
I1104 16:00:48.643423 212076 nvc_info.c:297] selecting /usr/bin/nvidia-smi
I1104 16:00:48.643440 212076 nvc_info.c:297] selecting /usr/bin/nvidia-debugdump
I1104 16:00:48.643461 212076 nvc_info.c:297] selecting /usr/bin/nvidia-persistenced
I1104 16:00:48.643486 212076 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-control
I1104 16:00:48.643503 212076 nvc_info.c:297] selecting /usr/bin/nvidia-cuda-mps-server
W1104 16:00:48.643568 212076 nvc_info.c:423] missing binary nv-fabricmanager
W1104 16:00:48.643584 212076 nvc_info.c:347] missing firmware path /lib/firmware/nvidia/440.33.01
I1104 16:00:48.643607 212076 nvc_info.c:520] listing device /dev/nvidiactl
I1104 16:00:48.643610 212076 nvc_info.c:520] listing device /dev/nvidia-uvm
I1104 16:00:48.643614 212076 nvc_info.c:520] listing device /dev/nvidia-uvm-tools
I1104 16:00:48.643617 212076 nvc_info.c:520] listing device /dev/nvidia-modeset
I1104 16:00:48.643639 212076 nvc_info.c:341] listing ipc path /var/run/nvidia-persistenced/socket
W1104 16:00:48.643653 212076 nvc_info.c:347] missing ipc path /var/run/nvidia-fabricmanager/socket
W1104 16:00:48.643666 212076 nvc_info.c:347] missing ipc path /tmp/nvidia-mps
I1104 16:00:48.643670 212076 nvc_info.c:814] requesting device information with ''
I1104 16:00:48.650236 212076 nvc_info.c:705] listing device /dev/nvidia0 (GPU-82bb69f3-589f-ee4b-956e-801eaf7a6d5d at 00000000:42:00.0)
I1104 16:00:48.656897 212076 nvc_info.c:705] listing device /dev/nvidia1 (GPU-4c400280-f6fe-38a2-b072-3d865f64b0e9 at 00000000:43:00.0)
I1104 16:00:48.663640 212076 nvc_info.c:705] listing device /dev/nvidia2 (GPU-66b910f4-75da-deb0-77ce-035dcff9e035 at 00000000:44:00.0)
I1104 16:00:48.670579 212076 nvc_info.c:705] listing device /dev/nvidia3 (GPU-b9300b30-f2d2-8f15-6ca1-871af93bb8aa at 00000000:45:00.0)
I1104 16:00:48.677677 212076 nvc_info.c:705] listing device /dev/nvidia4 (GPU-4e14ffd2-e125-a1f9-b6a0-936ad13bfe0e at 00000000:49:00.0)
I1104 16:00:48.684963 212076 nvc_info.c:705] listing device /dev/nvidia5 (GPU-93f0083b-0ab5-0c48-7a23-7716f2dfcf0c at 00000000:4a:00.0)
I1104 16:00:48.692426 212076 nvc_info.c:705] listing device /dev/nvidia6 (GPU-f9d4095a-458d-f5e4-1e07-bb70c9bc3a95 at 00000000:4b:00.0)
I1104 16:00:48.700245 212076 nvc_info.c:705] listing device /dev/nvidia7 (GPU-8c946ca2-c932-88c0-537a-2ce98eb6fb79 at 00000000:4c:00.0)
I1104 16:00:48.700326 212076 nvc.c:423] shutting down library context
I1104 16:00:48.702530 212117 driver.c:163] terminating driver service
I1104 16:00:48.703057 212076 driver.c:203] driver service terminated successfully