NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.12k stars 230 forks source link

T4 GPU #188

Open dimanzt opened 2 years ago

dimanzt commented 2 years ago

1. Issue or feature description

Running the example on a T4 GPU: I am getting the following error when running docker-compose up

ERROR: Service 'nbody' uses the IPC namespace of container 'mps-daemon' which does not exist.

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

-- WARNING, the following logs are for debugging purposes only --

I0416 02:41:34.659853 1477879 nvc.c:376] initializing library context (version=1.10.0~rc.1, build=d999036d0ee2d5a5ec47ede32e3a9d8c4d7b3992) I0416 02:41:34.659945 1477879 nvc.c:350] using root / I0416 02:41:34.659952 1477879 nvc.c:351] using ldcache /etc/ld.so.cache I0416 02:41:34.659979 1477879 nvc.c:352] using unprivileged user 65534:65534 I0416 02:41:34.660008 1477879 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0416 02:41:34.660135 1477879 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment I0416 02:41:34.667163 1477881 nvc.c:278] loading kernel module nvidia I0416 02:41:34.667493 1477881 nvc.c:282] running mknod for /dev/nvidiactl I0416 02:41:34.667547 1477881 nvc.c:286] running mknod for /dev/nvidia0 I0416 02:41:34.667576 1477881 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps I0416 02:41:34.677754 1477881 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config I0416 02:41:34.677920 1477881 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor I0416 02:41:34.680923 1477881 nvc.c:296] loading kernel module nvidia_uvm I0416 02:41:34.681036 1477881 nvc.c:300] running mknod for /dev/nvidia-uvm I0416 02:41:34.681134 1477881 nvc.c:305] loading kernel module nvidia_modeset I0416 02:41:34.681239 1477881 nvc.c:309] running mknod for /dev/nvidia-modeset I0416 02:41:34.681906 1477883 rpc.c:71] starting driver rpc service I0416 02:41:34.690019 1477884 rpc.c:71] starting nvcgo rpc service I0416 02:41:34.691238 1477879 nvc_info.c:765] requesting driver information with '' I0416 02:41:34.693054 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.510.47.03 I0416 02:41:34.693132 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.510.47.03 I0416 02:41:34.693175 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.510.47.03 I0416 02:41:34.693222 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.510.47.03 I0416 02:41:34.693287 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.510.47.03 I0416 02:41:34.693346 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.510.47.03 I0416 02:41:34.693387 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.510.47.03 I0416 02:41:34.693431 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.510.47.03 I0416 02:41:34.693491 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.510.47.03 I0416 02:41:34.693533 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.510.47.03 I0416 02:41:34.693571 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.510.47.03 I0416 02:41:34.693610 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.510.47.03 I0416 02:41:34.693668 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.510.47.03 I0416 02:41:34.693723 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.510.47.03 I0416 02:41:34.693761 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.510.47.03 I0416 02:41:34.693802 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.510.47.03 I0416 02:41:34.693860 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.510.47.03 I0416 02:41:34.693913 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.510.47.03 I0416 02:41:34.694223 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.510.47.03 I0416 02:41:34.694397 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.510.47.03 I0416 02:41:34.694439 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.510.47.03 I0416 02:41:34.694477 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.510.47.03 I0416 02:41:34.694518 1477879 nvc_info.c:172] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.510.47.03 I0416 02:41:34.694591 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.510.47.03 I0416 02:41:34.694632 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.510.47.03 I0416 02:41:34.694687 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.510.47.03 I0416 02:41:34.694743 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.510.47.03 I0416 02:41:34.694790 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.510.47.03 I0416 02:41:34.694842 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.510.47.03 I0416 02:41:34.694880 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.510.47.03 I0416 02:41:34.694913 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.510.47.03 I0416 02:41:34.694952 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.510.47.03 I0416 02:41:34.695004 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.510.47.03 I0416 02:41:34.695057 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.510.47.03 I0416 02:41:34.695092 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.510.47.03 I0416 02:41:34.695135 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.510.47.03 I0416 02:41:34.695207 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libcuda.so.510.47.03 I0416 02:41:34.695273 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.510.47.03 I0416 02:41:34.695310 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.510.47.03 I0416 02:41:34.695351 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.510.47.03 I0416 02:41:34.695391 1477879 nvc_info.c:172] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.510.47.03 W0416 02:41:34.695415 1477879 nvc_info.c:398] missing library libnvidia-nscq.so W0416 02:41:34.695423 1477879 nvc_info.c:398] missing library libnvidia-fatbinaryloader.so W0416 02:41:34.695436 1477879 nvc_info.c:398] missing library libnvidia-pkcs11.so W0416 02:41:34.695452 1477879 nvc_info.c:398] missing library libvdpau_nvidia.so W0416 02:41:34.695460 1477879 nvc_info.c:398] missing library libnvidia-ifr.so W0416 02:41:34.695467 1477879 nvc_info.c:398] missing library libnvidia-cbl.so W0416 02:41:34.695485 1477879 nvc_info.c:402] missing compat32 library libnvidia-cfg.so W0416 02:41:34.695494 1477879 nvc_info.c:402] missing compat32 library libnvidia-nscq.so W0416 02:41:34.695507 1477879 nvc_info.c:402] missing compat32 library libnvidia-fatbinaryloader.so W0416 02:41:34.695522 1477879 nvc_info.c:402] missing compat32 library libnvidia-allocator.so W0416 02:41:34.695537 1477879 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so W0416 02:41:34.695547 1477879 nvc_info.c:402] missing compat32 library libnvidia-ngx.so W0416 02:41:34.695556 1477879 nvc_info.c:402] missing compat32 library libvdpau_nvidia.so W0416 02:41:34.695570 1477879 nvc_info.c:402] missing compat32 library libnvidia-ifr.so W0416 02:41:34.695585 1477879 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so W0416 02:41:34.695596 1477879 nvc_info.c:402] missing compat32 library libnvoptix.so W0416 02:41:34.695608 1477879 nvc_info.c:402] missing compat32 library libnvidia-cbl.so I0416 02:41:34.695893 1477879 nvc_info.c:298] selecting /usr/bin/nvidia-smi I0416 02:41:34.695917 1477879 nvc_info.c:298] selecting /usr/bin/nvidia-debugdump I0416 02:41:34.695940 1477879 nvc_info.c:298] selecting /usr/bin/nvidia-persistenced I0416 02:41:34.695975 1477879 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-control I0416 02:41:34.695996 1477879 nvc_info.c:298] selecting /usr/bin/nvidia-cuda-mps-server W0416 02:41:34.696157 1477879 nvc_info.c:424] missing binary nv-fabricmanager I0416 02:41:34.696196 1477879 nvc_info.c:342] listing firmware path /usr/lib/firmware/nvidia/510.47.03/gsp.bin I0416 02:41:34.696228 1477879 nvc_info.c:528] listing device /dev/nvidiactl I0416 02:41:34.696235 1477879 nvc_info.c:528] listing device /dev/nvidia-uvm I0416 02:41:34.696249 1477879 nvc_info.c:528] listing device /dev/nvidia-uvm-tools I0416 02:41:34.696264 1477879 nvc_info.c:528] listing device /dev/nvidia-modeset I0416 02:41:34.696306 1477879 nvc_info.c:342] listing ipc path /run/nvidia-persistenced/socket W0416 02:41:34.696331 1477879 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket W0416 02:41:34.696352 1477879 nvc_info.c:348] missing ipc path /tmp/nvidia-mps I0416 02:41:34.696361 1477879 nvc_info.c:821] requesting device information with '' I0416 02:41:34.705561 1477879 nvc_info.c:712] listing device /dev/nvidia0 (GPU-39cd9454-0976-646a-acef-01a35ad02034 at 00000000:83:00.0) NVRM version: 510.47.03 CUDA version: 11.6

Device Index: 0 Device Minor: 0 Model: Tesla T4 Brand: Nvidia GPU UUID: GPU-39cd9454-0976-646a-acef-01a35ad02034 Bus Location: 00000000:83:00.0 Architecture: 7.5 I0416 02:41:34.705671 1477879 nvc.c:434] shutting down library context I0416 02:41:34.705743 1477884 rpc.c:95] terminating nvcgo rpc service I0416 02:41:34.706699 1477879 rpc.c:135] nvcgo rpc service terminated successfully I0416 02:41:34.710957 1477883 rpc.c:95] terminating driver rpc service I0416 02:41:34.711263 1477879 rpc.c:135] driver rpc service terminated successfully

==============NVSMI LOG==============

Timestamp : Fri Apr 15 19:43:26 2022 Driver Version : 510.47.03 CUDA Version : 11.6

Attached GPUs : 1 GPU 00000000:83:00.0 Product Name : Tesla T4 Product Brand : NVIDIA Product Architecture : Turing Display Mode : Enabled Display Active : Disabled Persistence Mode : Enabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : 1323321072957 GPU UUID : GPU-39cd9454-0976-646a-acef-01a35ad02034 Minor Number : 0 VBIOS Version : 90.04.96.00.9F MultiGPU Board : No Board ID : 0x8300 GPU Part Number : 900-2G183-0300-000 Module ID : 0 Inforom Version Image Version : G183.0200.00.02 OEM Object : 1.1 ECC Object : 5.0 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : 510.47.03 GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x83 Device : 0x00 Domain : 0x0000 Device Id : 0x1EB810DE Bus Id : 00000000:83:00.0 Sub System Id : 0x12A210DE GPU Link Info PCIe Generation Max : 3 Current : 1 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 15360 MiB Reserved : 449 MiB Used : 14 MiB Free : 14895 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Exclusive_Process Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Enabled Pending : Enabled ECC Errors Volatile SRAM Correctable : 0 SRAM Uncorrectable : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Aggregate SRAM Correctable : 0 SRAM Uncorrectable : 0 DRAM Correctable : 0 DRAM Uncorrectable : 0 Retired Pages Single Bit ECC : 0 Double Bit ECC : 0 Pending Page Blacklist : No Remapped Rows : N/A Temperature GPU Current Temp : 39 C GPU Shutdown Temp : 96 C GPU Slowdown Temp : 93 C GPU Max Operating Temp : 85 C GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 16.83 W Power Limit : 70.00 W Default Power Limit : 70.00 W Enforced Power Limit : 70.00 W Min Power Limit : 60.00 W Max Power Limit : 70.00 W Clocks Graphics : 300 MHz SM : 300 MHz Memory : 405 MHz Video : 540 MHz Applications Clocks Graphics : 585 MHz Memory : 5001 MHz Default Applications Clocks Graphics : 585 MHz Memory : 5001 MHz Max Clocks Graphics : 1590 MHz SM : 1590 MHz Memory : 5001 MHz Video : 1470 MHz Max Customer Boost Clocks Graphics : 1590 MHz Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 1883 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 14 MiB

Server: Docker Engine - Community Engine: Version: 20.10.13 API version: 1.41 (minimum version 1.12) Go version: go1.16.15 Git commit: 906f57f Built: Thu Mar 10 14:05:44 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.5.10 GitCommit: 2a1d4dbdb2a1030dc5b01e96fb110a9d9f150ecc nvidia: Version: 1.0.3 GitCommit: v1.0.3-0-gf46b6ba docker-init: Version: 0.19.0 GitCommit: de40ad0

ERROR: Service 'nbody' uses the IPC namespace of container 'mps-daemon' which does not exist.

klueska commented 2 years ago

Whatever container you are running apparently expects to be running with MPS, but you don't don't seem to have the MPS server running on your host. If you did, then nvidia-docker would inject the MPS socket into the container for you.

dimanzt commented 2 years ago

Thanks Kluska, is there a way to see how many vGPUs have been assigned to the container?

klueska commented 2 years ago

What do you mean by vGPU in this context? Are you running on to of NVIDIA's vGPUdriver? Or are you just running on a standard data-center GPU driver and wanting to see how many GPUs have been assigned to the container?

In any case, I'm not familiar with docker-compose (we don't test against it, so it may or may not work out of the box), but in general, if you have a way of execing into your running contianer, you can always ls -la on /dev to see which GPU devices have been injected.