NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.25k stars 245 forks source link

x.cuda() not able to access cuda and hangs inside docker #149

Open prashkmr opened 1 year ago

prashkmr commented 1 year ago

2. Steps to reprduce the issue

1) sudo docker run --gpus 0 -it --rm nvcr.io/nvidia/pytorch:19.06-py3 2) inside the container use the python terminal write this cide import torch import numpy as np x=torch.from_numpy(np.array([1,2])) y=x.cuda()

the terminal hangs and doesn't proceed

I1211 16:03:56.895535 862955 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977) I1211 16:03:56.895630 862955 nvc.c:350] using root / I1211 16:03:56.895650 862955 nvc.c:351] using ldcache /etc/ld.so.cache I1211 16:03:56.895668 862955 nvc.c:352] using unprivileged user 1013:1014 I1211 16:03:56.895712 862955 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I1211 16:03:56.895983 862955 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment W1211 16:03:56.898601 862956 nvc.c:273] failed to set inheritable capabilities W1211 16:03:56.898686 862956 nvc.c:274] skipping kernel modules load due to failure I1211 16:03:56.899197 862957 rpc.c:71] starting driver rpc service I1211 16:03:56.914387 862958 rpc.c:71] starting nvcgo rpc service I1211 16:03:56.916258 862955 nvc_info.c:766] requesting driver information with '' I1211 16:03:56.918835 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.525.60.13 I1211 16:03:56.918925 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.525.60.13 I1211 16:03:56.918982 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.525.60.13 I1211 16:03:56.919039 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.525.60.13 I1211 16:03:56.919135 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.525.60.13 I1211 16:03:56.919260 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.525.60.13 I1211 16:03:56.919321 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.525.60.13 I1211 16:03:56.919375 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.525.60.13 I1211 16:03:56.919451 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.525.60.13 I1211 16:03:56.919502 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.525.60.13 I1211 16:03:56.919551 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.525.60.13 I1211 16:03:56.919603 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.525.60.13 I1211 16:03:56.919716 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.525.60.13 I1211 16:03:56.919816 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.525.60.13 I1211 16:03:56.919887 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.525.60.13 I1211 16:03:56.919959 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.525.60.13 I1211 16:03:56.920067 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.525.60.13 I1211 16:03:56.920176 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.525.60.13 I1211 16:03:56.920761 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.525.60.13 I1211 16:03:56.920835 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.525.60.13 I1211 16:03:56.921120 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.525.60.13 I1211 16:03:56.921206 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.525.60.13 I1211 16:03:56.921283 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.525.60.13 I1211 16:03:56.921369 862955 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.525.60.13 I1211 16:03:56.921484 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.525.60.13 I1211 16:03:56.921589 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.525.60.13 I1211 16:03:56.921696 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.525.60.13 I1211 16:03:56.921769 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.525.60.13 I1211 16:03:56.921873 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.525.60.13 I1211 16:03:56.921978 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.525.60.13 I1211 16:03:56.922051 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.525.60.13 I1211 16:03:56.922209 862955 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.525.60.13 W1211 16:03:56.922309 862955 nvc_info.c:399] missing library libnvidia-nscq.so W1211 16:03:56.922323 862955 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so W1211 16:03:56.922338 862955 nvc_info.c:399] missing library libnvidia-pkcs11.so W1211 16:03:56.922353 862955 nvc_info.c:399] missing library libvdpau_nvidia.so W1211 16:03:56.922368 862955 nvc_info.c:399] missing library libnvidia-ifr.so W1211 16:03:56.922382 862955 nvc_info.c:399] missing library libnvidia-cbl.so W1211 16:03:56.922397 862955 nvc_info.c:403] missing compat32 library libnvidia-cfg.so W1211 16:03:56.922412 862955 nvc_info.c:403] missing compat32 library libnvidia-nscq.so W1211 16:03:56.922427 862955 nvc_info.c:403] missing compat32 library libcudadebugger.so W1211 16:03:56.922441 862955 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so W1211 16:03:56.922454 862955 nvc_info.c:403] missing compat32 library libnvidia-allocator.so W1211 16:03:56.922468 862955 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so W1211 16:03:56.922483 862955 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W1211 16:03:56.922495 862955 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so W1211 16:03:56.922507 862955 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so W1211 16:03:56.922522 862955 nvc_info.c:403] missing compat32 library libnvidia-glcore.so W1211 16:03:56.922534 862955 nvc_info.c:403] missing compat32 library libnvidia-tls.so W1211 16:03:56.922547 862955 nvc_info.c:403] missing compat32 library libnvidia-glsi.so W1211 16:03:56.922562 862955 nvc_info.c:403] missing compat32 library libnvidia-fbc.so W1211 16:03:56.922576 862955 nvc_info.c:403] missing compat32 library libnvidia-ifr.so W1211 16:03:56.922590 862955 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W1211 16:03:56.922605 862955 nvc_info.c:403] missing compat32 library libnvoptix.so W1211 16:03:56.922620 862955 nvc_info.c:403] missing compat32 library libGLX_nvidia.so W1211 16:03:56.922635 862955 nvc_info.c:403] missing compat32 library libEGL_nvidia.so W1211 16:03:56.922649 862955 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so W1211 16:03:56.922663 862955 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so W1211 16:03:56.922676 862955 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so W1211 16:03:56.922691 862955 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I1211 16:03:56.924411 862955 nvc_info.c:299] selecting /usr/bin/nvidia-smi I1211 16:03:56.924457 862955 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I1211 16:03:56.924495 862955 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I1211 16:03:56.924559 862955 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control I1211 16:03:56.924600 862955 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server W1211 16:03:56.924729 862955 nvc_info.c:425] missing binary nv-fabricmanager W1211 16:03:56.924793 862955 nvc_info.c:349] missing firmware path /lib/firmware/nvidia/525.60.13/gsp.bin I1211 16:03:56.924846 862955 nvc_info.c:529] listing device /dev/nvidiactl I1211 16:03:56.924862 862955 nvc_info.c:529] listing device /dev/nvidia-uvm I1211 16:03:56.924877 862955 nvc_info.c:529] listing device /dev/nvidia-uvm-tools I1211 16:03:56.924889 862955 nvc_info.c:529] listing device /dev/nvidia-modeset I1211 16:03:56.924944 862955 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket W1211 16:03:56.924993 862955 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket W1211 16:03:56.925026 862955 nvc_info.c:349] missing ipc path /tmp/nvidia-mps I1211 16:03:56.925042 862955 nvc_info.c:822] requesting device information with '' I1211 16:03:56.931600 862955 nvc_info.c:713] listing device /dev/nvidia0 (GPU-15132fc8-6058-2032-32e5-8d742940df99 at 00000000:03:00.0) NVRM version: 525.60.13 CUDA version: 12.0

Device Index: 0 Device Minor: 0 Model: NVIDIA GeForce RTX 3090 Brand: GeForce GPU UUID: GPU-15132fc8-6058-2032-32e5-8d742940df99 Bus Location: 00000000:03:00.0 Architecture: 8.6 I1211 16:03:56.931670 862955 nvc.c:434] shutting down library context I1211 16:03:56.931753 862958 rpc.c:95] terminating nvcgo rpc service I1211 16:03:56.932462 862955 rpc.c:135] nvcgo rpc service terminated successfully I1211 16:03:56.935845 862957 rpc.c:95] terminating driver rpc service I1211 16:03:56.936030 862955 rpc.c:135] driver rpc service terminated successfully

Timestamp : Sun Dec 11 21:35:37 2022 Driver Version : 525.60.13 CUDA Version : 12.0

Attached GPUs : 1 GPU 00000000:03:00.0 Product Name : NVIDIA GeForce RTX 3090 Product Brand : GeForce Product Architecture : Ampere Display Mode : Enabled Display Active : Disabled Persistence Mode : Enabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : N/A Pending : N/A Serial Number : N/A GPU UUID : GPU-15132fc8-6058-2032-32e5-8d742940df99 Minor Number : 0 VBIOS Version : 94.02.42.40.72 MultiGPU Board : No Board ID : 0x300 Board Part Number : N/A GPU Part Number : 2204-300-A1 Module ID : 0 Inforom Version Image Version : G001.0000.03.03 OEM Object : 2.0 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x03 Device : 0x00 Domain : 0x0000 Device Id : 0x220410DE Bus Id : 00000000:03:00.0 Sub System Id : 0x145410DE GPU Link Info PCIe Generation Max : 3 Current : 1 Device Current : 1 Device Max : 4 Host Max : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 0 KB/s Rx Throughput : 0 KB/s Atomic Caps Inbound : N/A Atomic Caps Outbound : N/A Fan Speed : 0 % Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 24576 MiB Reserved : 317 MiB Used : 553 MiB Free : 23705 MiB BAR1 Memory Usage Total : 256 MiB Used : 7 MiB Free : 249 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 51 C GPU Shutdown Temp : 98 C GPU Slowdown Temp : 95 C GPU Max Operating Temp : 93 C GPU Target Temperature : 83 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 23.19 W Power Limit : 370.00 W Default Power Limit : 370.00 W Enforced Power Limit : 370.00 W Min Power Limit : 100.00 W Max Power Limit : 370.00 W Clocks Graphics : 0 MHz SM : 0 MHz Memory : 405 MHz Video : 555 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Deferred Clocks Memory : N/A Max Clocks Graphics : 2115 MHz SM : 2115 MHz Memory : 9751 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : 0.000 mV Fabric State : N/A Status : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 2052 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 55 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 2252 Type : G Name : /usr/bin/gnome-shell Used GPU Memory : 12 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 79713 Type : G Name : /usr/lib/xorg/Xorg Used GPU Memory : 53 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 105354 Type : G Name : /usr/lib/firefox/firefox Used GPU Memory : 296 MiB GPU instance ID : N/A Compute instance ID : N/A Process ID : 150301 Type : G Name : /usr/share/code/code --type=gpu-process --disable-color-correct-rendering --enable-crashpad --crashpad-handler-pid=150289 --enable-crash-reporter=220d09c7-ace4-4d8a-a019-f5551f55bb4e,no_channel --user-data-dir=/home/vishesh/.config/Code --gpu-preferences=UAAAAAAAAAAgAAAIAAAAAAAAAAAAAAAAAABgAAAAAAAwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAAABgAAAAAAAAAGAAAAAAAAAAIAAAAAAAAAAgAAAAAAAAACAAAAAAAAAA= --shared-files --field-trial-handle=0,5632948566441798013,1364472480275554198,131072 --disable-features=PlzServiceWorker,SpareRendererForSitePerProcess Used GPU Memory : 27 MiB

Server: Docker Engine - Community Engine: Version: 20.10.21 API version: 1.41 (minimum version 1.12) Go version: go1.18.7 Git commit: 3056208 Built: Tue Oct 25 18:00:04 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.12 GitCommit: a05d175400b1145e5e6a735a6710579d181e7fb0 runc: Version: 1.1.4 GitCommit: v1.1.4-0-g5fd4c4d docker-init: Version: 0.19.0 GitCommit: de40ad0

elezar commented 10 months ago

Note that on the docker command line --gpus 0 indicates a count and not the device index. Could you repeat your experiment with --gpus 1 or --gpus all instead?