NVIDIA / libnvidia-container

NVIDIA container runtime library
Apache License 2.0
759 stars 189 forks source link

Unable to use more than 5 GPU cards #235

Open junqiang1992 opened 5 months ago

junqiang1992 commented 5 months ago

What is being built is a kata environment. The host has 8 GPU cards. If 1-5 GPU cards are used to create a pod, nvidia-container-cli will run normally, but problems will occur if 6 GPUs are used. After locating, the main reason is that the code calls ns_enter, switches to the rootfs of the container, and cannot find the corresponding directory when mounting_procfs. The environment version information is as follows:

nvidia-container-toolkit version

root@ubuntu-dev:/# dpkg -l | grep nvidia-container-toolkit ii nvidia-container-toolkit 1.14.3-1 amd64 NVIDIA Container toolkit ii nvidia-container-toolkit-base 1.14.3-1 amd64 NVIDIA Container Toolkit Base

nvidia-container-cli,The running log information is as follows:

` -- WARNING, the following logs are for debugging purposes only --

I0111 10:30:12.940268 149 nvc.c:376] initializing library context (version=1.14.1, build=1eb5a30a6ad0415550a9df632ac8832bf7e2bbba) I0111 10:30:12.940332 149 nvc.c:350] using root / I0111 10:30:12.940334 149 nvc.c:351] using ldcache /etc/ld.so.cache I0111 10:30:12.940336 149 nvc.c:352] using unprivileged user 65534:65534 I0111 10:30:12.940347 149 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0111 10:30:12.940484 149 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment I0111 10:30:12.943220 179 nvc.c:278] loading kernel module nvidia I0111 10:30:12.943338 179 nvc.c:282] running mknod for /dev/nvidiactl I0111 10:30:12.943362 179 nvc.c:286] running mknod for /dev/nvidia0 I0111 10:30:12.943374 179 nvc.c:286] running mknod for /dev/nvidia1 I0111 10:30:12.943382 179 nvc.c:286] running mknod for /dev/nvidia2 I0111 10:30:12.943390 179 nvc.c:286] running mknod for /dev/nvidia3 I0111 10:30:12.943399 179 nvc.c:286] running mknod for /dev/nvidia4 I0111 10:30:12.943407 179 nvc.c:286] running mknod for /dev/nvidia5 I0111 10:30:12.943415 179 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps I0111 10:30:12.947452 179 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config I0111 10:30:12.947504 179 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor I0111 10:30:12.949693 179 nvc.c:296] loading kernel module nvidia_uvm I0111 10:30:12.949702 179 nvc.c:300] running mknod for /dev/nvidia-uvm I0111 10:30:12.949735 179 nvc.c:305] loading kernel module nvidia_modeset I0111 10:30:12.955489 179 nvc.c:309] running mknod for /dev/nvidia-modeset I0111 10:30:12.955701 183 rpc.c:71] starting driver rpc service I0111 10:30:19.067427 229 rpc.c:71] starting nvcgo rpc service I0111 10:30:19.072896 149 nvc_container.c:246] configuring container with 'compute utility supervised' I0111 10:30:19.077084 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libcuda.so.545.23.08 I0111 10:30:19.077456 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libcudadebugger.so.545.23.08 I0111 10:30:19.077803 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libnvidia-nvvm.so.545.23.08 I0111 10:30:19.078148 149 nvc_container.c:88] selecting /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/usr/local/cuda-12.3/compat/libnvidia-ptxjitcompiler.so.545.23.08 I0111 10:30:19.079973 149 nvc_container.c:268] setting pid to 147 I0111 10:30:19.080003 149 nvc_container.c:269] setting rootfs to /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs I0111 10:30:19.080011 149 nvc_container.c:270] setting owner to 0:0 I0111 10:30:19.080018 149 nvc_container.c:271] setting bins directory to /usr/bin I0111 10:30:19.080038 149 nvc_container.c:272] setting libs directory to /usr/lib/x86_64-linux-gnu I0111 10:30:19.080045 149 nvc_container.c:273] setting libs32 directory to /usr/lib/i386-linux-gnu I0111 10:30:19.080052 149 nvc_container.c:274] setting cudart directory to /usr/local/cuda I0111 10:30:19.080058 149 nvc_container.c:275] setting ldconfig to @/sbin/ldconfig.real (host relative) I0111 10:30:19.080064 149 nvc_container.c:276] setting mount namespace to /proc/147/ns/mnt I0111 10:30:19.080070 149 nvc_container.c:278] detected cgroupv1 I0111 10:30:19.080077 149 nvc_container.c:279] setting devices cgroup to /sys/fs/cgroup/devices/663e4a00_6864_4e19_8a5f_15d850583969/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0 I0111 10:30:19.080134 149 nvc_info.c:798] requesting driver information with '' I0111 10:30:19.083496 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.535.146.02 I0111 10:30:19.083715 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.535.146.02 I0111 10:30:19.083826 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11.so.535.146.02 I0111 10:30:19.083920 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.535.146.02 I0111 10:30:19.084025 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.535.146.02 I0111 10:30:19.084159 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.535.146.02 I0111 10:30:19.084254 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.535.146.02 I0111 10:30:19.084412 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.535.146.02 I0111 10:30:19.084542 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.535.146.02 I0111 10:30:19.084657 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.535.146.02 I0111 10:30:19.084783 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.535.146.02 I0111 10:30:19.084883 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.535.146.02 I0111 10:30:19.084963 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.535.146.02 I0111 10:30:19.085071 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.535.146.02 I0111 10:30:19.085101 149 nvc_info.c:176] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.535.146.02 W0111 10:30:19.085145 149 nvc_info.c:402] missing library libnvidia-nscq.so W0111 10:30:19.085150 149 nvc_info.c:402] missing library libnvidia-gpucomp.so W0111 10:30:19.085152 149 nvc_info.c:402] missing library libnvidia-fatbinaryloader.so W0111 10:30:19.085155 149 nvc_info.c:402] missing library libnvidia-compiler.so W0111 10:30:19.085158 149 nvc_info.c:402] missing library libnvidia-ngx.so W0111 10:30:19.085160 149 nvc_info.c:402] missing library libnvidia-eglcore.so W0111 10:30:19.085163 149 nvc_info.c:402] missing library libnvidia-glcore.so W0111 10:30:19.085165 149 nvc_info.c:402] missing library libnvidia-tls.so W0111 10:30:19.085167 149 nvc_info.c:402] missing library libnvidia-glsi.so W0111 10:30:19.085170 149 nvc_info.c:402] missing library libnvidia-ifr.so W0111 10:30:19.085172 149 nvc_info.c:402] missing library libnvidia-rtcore.so W0111 10:30:19.085175 149 nvc_info.c:402] missing library libnvoptix.so W0111 10:30:19.085177 149 nvc_info.c:402] missing library libGLX_nvidia.so W0111 10:30:19.085180 149 nvc_info.c:402] missing library libEGL_nvidia.so W0111 10:30:19.085182 149 nvc_info.c:402] missing library libGLESv2_nvidia.so W0111 10:30:19.085184 149 nvc_info.c:402] missing library libGLESv1_CM_nvidia.so W0111 10:30:19.085187 149 nvc_info.c:402] missing library libnvidia-glvkspirv.so W0111 10:30:19.085189 149 nvc_info.c:402] missing library libnvidia-cbl.so W0111 10:30:19.085192 149 nvc_info.c:406] missing compat32 library libnvidia-ml.so W0111 10:30:19.085194 149 nvc_info.c:406] missing compat32 library libnvidia-cfg.so W0111 10:30:19.085197 149 nvc_info.c:406] missing compat32 library libnvidia-nscq.so W0111 10:30:19.085199 149 nvc_info.c:406] missing compat32 library libcuda.so W0111 10:30:19.085202 149 nvc_info.c:406] missing compat32 library libcudadebugger.so W0111 10:30:19.085204 149 nvc_info.c:406] missing compat32 library libnvidia-opencl.so W0111 10:30:19.085206 149 nvc_info.c:406] missing compat32 library libnvidia-gpucomp.so W0111 10:30:19.085209 149 nvc_info.c:406] missing compat32 library libnvidia-ptxjitcompiler.so W0111 10:30:19.085211 149 nvc_info.c:406] missing compat32 library libnvidia-fatbinaryloader.so W0111 10:30:19.085214 149 nvc_info.c:406] missing compat32 library libnvidia-allocator.so W0111 10:30:19.085216 149 nvc_info.c:406] missing compat32 library libnvidia-compiler.so W0111 10:30:19.085219 149 nvc_info.c:406] missing compat32 library libnvidia-pkcs11.so W0111 10:30:19.085221 149 nvc_info.c:406] missing compat32 library libnvidia-pkcs11-openssl3.so W0111 10:30:19.085230 149 nvc_info.c:406] missing compat32 library libnvidia-nvvm.so W0111 10:30:19.085233 149 nvc_info.c:406] missing compat32 library libnvidia-ngx.so W0111 10:30:19.085235 149 nvc_info.c:406] missing compat32 library libvdpau_nvidia.so W0111 10:30:19.085238 149 nvc_info.c:406] missing compat32 library libnvidia-encode.so W0111 10:30:19.085241 149 nvc_info.c:406] missing compat32 library libnvidia-opticalflow.so W0111 10:30:19.085243 149 nvc_info.c:406] missing compat32 library libnvcuvid.so W0111 10:30:19.085246 149 nvc_info.c:406] missing compat32 library libnvidia-eglcore.so W0111 10:30:19.085249 149 nvc_info.c:406] missing compat32 library libnvidia-glcore.so W0111 10:30:19.085251 149 nvc_info.c:406] missing compat32 library libnvidia-tls.so W0111 10:30:19.085254 149 nvc_info.c:406] missing compat32 library libnvidia-glsi.so W0111 10:30:19.085256 149 nvc_info.c:406] missing compat32 library libnvidia-fbc.so W0111 10:30:19.085259 149 nvc_info.c:406] missing compat32 library libnvidia-ifr.so W0111 10:30:19.085262 149 nvc_info.c:406] missing compat32 library libnvidia-rtcore.so W0111 10:30:19.085264 149 nvc_info.c:406] missing compat32 library libnvoptix.so W0111 10:30:19.085267 149 nvc_info.c:406] missing compat32 library libGLX_nvidia.so W0111 10:30:19.085270 149 nvc_info.c:406] missing compat32 library libEGL_nvidia.so W0111 10:30:19.085272 149 nvc_info.c:406] missing compat32 library libGLESv2_nvidia.so W0111 10:30:19.085275 149 nvc_info.c:406] missing compat32 library libGLESv1_CM_nvidia.so W0111 10:30:19.085277 149 nvc_info.c:406] missing compat32 library libnvidia-glvkspirv.so W0111 10:30:19.085280 149 nvc_info.c:406] missing compat32 library libnvidia-cbl.so I0111 10:30:19.085495 149 nvc_info.c:302] selecting /usr/bin/nvidia-smi I0111 10:30:19.085511 149 nvc_info.c:302] selecting /usr/bin/nvidia-debugdump I0111 10:30:19.085525 149 nvc_info.c:302] selecting /usr/bin/nvidia-persistenced I0111 10:30:19.085549 149 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-control I0111 10:30:19.085563 149 nvc_info.c:302] selecting /usr/bin/nvidia-cuda-mps-server W0111 10:30:19.085591 149 nvc_info.c:428] missing binary nv-fabricmanager I0111 10:30:19.085667 149 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/535.146.02/gsp_ga10x.bin I0111 10:30:19.085671 149 nvc_info.c:488] listing firmware path /lib/firmware/nvidia/535.146.02/gsp_tu10x.bin I0111 10:30:19.085688 149 nvc_info.c:561] listing device /dev/nvidiactl I0111 10:30:19.085691 149 nvc_info.c:561] listing device /dev/nvidia-uvm I0111 10:30:19.085693 149 nvc_info.c:561] listing device /dev/nvidia-uvm-tools I0111 10:30:19.085696 149 nvc_info.c:561] listing device /dev/nvidia-modeset W0111 10:30:19.085712 149 nvc_info.c:352] missing ipc path /var/run/nvidia-persistenced/socket W0111 10:30:19.085724 149 nvc_info.c:352] missing ipc path /var/run/nvidia-fabricmanager/socket W0111 10:30:19.085735 149 nvc_info.c:352] missing ipc path /tmp/nvidia-mps I0111 10:30:19.085739 149 nvc_info.c:854] requesting device information with '' I0111 10:30:19.093013 149 nvc_info.c:745] listing device /dev/nvidia0 (GPU-15dd6db0-ca52-31f5-3daf-2019882683b0 at 00000000:02:00.0) I0111 10:30:19.101035 149 nvc_info.c:745] listing device /dev/nvidia1 (GPU-2f4bb339-fc05-e25d-512c-05c7eefd99e1 at 00000000:04:00.0) I0111 10:30:19.109848 149 nvc_info.c:745] listing device /dev/nvidia2 (GPU-23e7b85e-0792-0721-9d22-ac2a0e9bac2b at 00000000:06:00.0) I0111 10:30:19.118514 149 nvc_info.c:745] listing device /dev/nvidia3 (GPU-d78a3ff1-47b6-80be-95a5-ffe77853335f at 00000000:08:00.0) I0111 10:30:19.126998 149 nvc_info.c:745] listing device /dev/nvidia4 (GPU-3dbafd65-2a6e-1445-9569-a70fa746902d at 00000000:0a:00.0) I0111 10:30:19.135642 149 nvc_info.c:745] listing device /dev/nvidia5 (GPU-d9bc47e7-fd6d-f1dc-ff31-d9675bb73087 at 00000000:0c:00.0) `

nvidia-container-cli,code positioning analysis is as follows:【src/nvc_mount.c 】

` int nvc_driver_mount(struct nvc_context ctx, const struct nvc_container cnt, const struct nvc_driver_info *info) { const char mnt, ptr, **tmp; size_t nmnt; int rv = -1;

    if (validate_context(ctx) < 0)
            return (-1);
    if (validate_args(ctx, cnt != NULL && info != NULL) < 0)
            return (-1);

    if (ns_enter(&ctx->err, cnt->mnt_ns, CLONE_NEWNS) < 0)
            return (-1);

    nmnt = 2 + info->nbins + info->nlibs + cnt->nlibs + info->nlibs32 + info->nipcs + info->ndevs + info->nfirmwares;
    mnt = ptr = (const char **)array_new(&ctx->err, nmnt);
    if (mnt == NULL)
            goto fail;

    /* Procfs mount */
    if (ctx->dxcore.initialized)
            log_warn("skipping procfs mount on WSL");
    else if ((*ptr++ = mount_procfs(&ctx->err, ctx->cfg.root, cnt)) == NULL)
            goto fail;

` After locating, it was found that the problem occurred on ns_enter func. Under normal circumstances during the test, even if the ns_enter interface was called, the file system path inside the virtual machine could be viewed, and then it could be mounted normally. [The mounting path is: /run/kata -Containers/AE1F1999611632C96D7EF8A5D5F51894D377259F06336911d02F67474D0/ROOTFS/Prive/Driver/Nvidia], for example, the abnormal scene used 6 GPU cards and called ns_enter, will directly enter the ROOTFS of the container as the root directory, so mount [/run/kata- containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs/proc/driver/nvidia], the path cannot be found. I don’t understand the reason here. I don’t know how to solve this problem.

junqiang1992 commented 5 months ago

The nvidia-container-cli process never exited

root@localhost:/# ps -ef | grep con root 82 2 0 10:30 ? 00:00:00 [ipv6_addrconf] root 103 2 0 10:30 ? 00:00:00 [ext4-rsv-conver] nobody 183 123 0 10:30 ? 00:00:05 /usr/bin/nvidia-container-cli --load-kmods --debug=/run/nvidia-container-toolkit.log configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=12.3 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,driver>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=535,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536 --pid=147 /run/kata-containers/ae1f8199611632c96d7e2ef8a5d5f51894d377259f062f6336911d02f67474d0/rootfs

junqiang1992 commented 5 months ago

The problem has been solved. It was caused by a timeout bug in kata-agent.