Open boyang9602 opened 2 years ago
It's a strange bug, because GPU is available despite the error message, it's fixed in later images (don't mind nvidia-smi and driver version, it's the same with 510.06):
➜ docker run --gpus all --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorflow:22.01-tf2-py3 nvidia-smi
================
== TensorFlow ==
================
NVIDIA Release 22.01-tf2 (build 31081301)
TensorFlow Version 2.7.0
Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright 2017-2022 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.
Fri Feb 4 11:22:39 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01 Driver Version: 511.23 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| N/A 56C P8 4W / N/A | 312MiB / 6144MiB | 15% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
compared to 20.03:
➜ docker run --gpus all --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorflow:20.03-tf2-py3 nvidia-smi
================
== TensorFlow ==
================
NVIDIA Release 20.03-tf2 (build 11026100)
TensorFlow Version 2.1.0
Container image Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
Copyright 2017-2019 The TensorFlow Authors. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use 'nvidia-docker run' to start this container; see
https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .
NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.
Fri Feb 4 11:23:46 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01 Driver Version: 511.23 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:01:00.0 Off | N/A |
| N/A 53C P8 3W / N/A | 316MiB / 6144MiB | 4% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Btw, try updating your wsl kernel, 4.19 is pretty old
I have the same issue, exactly the same setup, except for the newest kernel version. Tried modprobe nvidia and got:
modprobe: FATAL: Module nvidia not found in directory /lib/modules/5.10.60.1-microsoft-standard-WSL2
The GPU is detected and theoretically runs, in the container, but it only reserves the GPU memory. Utilization of the GPU stays at 0% which means that my cpu can perform faster calculations. Anyone found a solution yet?
I exactly have the same issue on WSL 2.
I've solved my issue by using the newest Nvidia docker container. For some reason the gpu is fully utilized now
1. Issue or feature description
I'm trying to use the Nvidia docker on WSL 2. I installed the driver on the host, and followed this guide to install the nividia-docker2.
When I tried
docker run --gpus all -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorflow:20.03-tf2-py3
The output is not as expected:
When I try
nvidia-docker run --gpus all -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/tensorflow:20.03-tf2-py3
, the output isI tried
modprobe nvidia
, the output ismodprobe: FATAL: Module nvidia not found in directory /lib/modules/4.19.128-microsoft-standard
.3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I0203 18:12:33.732533 1637 nvc.c:376] initializing library context (version=1.8.0~rc.2, build=d48f9b0d505fca0aff7c88cee790f9c56aa1b851) I0203 18:12:33.732591 1637 nvc.c:350] using root / I0203 18:12:33.732597 1637 nvc.c:351] using ldcache /etc/ld.so.cache I0203 18:12:33.732600 1637 nvc.c:352] using unprivileged user 1000:1000 I0203 18:12:33.732620 1637 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0203 18:12:33.749801 1637 dxcore.c:227] Creating a new WDDM Adapter for hAdapter:40000000 luid:356cc6 I0203 18:12:33.756366 1637 dxcore.c:210] Core Nvidia component libcuda.so.1.1 not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_4be767c332df1d04 I0203 18:12:33.756975 1637 dxcore.c:210] Core Nvidia component libcuda_loader.so not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_4be767c332df1d04 I0203 18:12:33.757599 1637 dxcore.c:210] Core Nvidia component libnvidia-ptxjitcompiler.so.1 not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_4be767c332df1d04 I0203 18:12:33.758180 1637 dxcore.c:210] Core Nvidia component libnvidia-ml.so.1 not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_4be767c332df1d04 I0203 18:12:33.758826 1637 dxcore.c:210] Core Nvidia component libnvidia-ml_loader.so not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_4be767c332df1d04 I0203 18:12:33.759386 1637 dxcore.c:210] Core Nvidia component nvidia-smi not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_4be767c332df1d04 I0203 18:12:33.759410 1637 dxcore.c:215] No Nvidia component found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_4be767c332df1d04 E0203 18:12:33.759431 1637 dxcore.c:261] Failed to query the core Nvidia libraries for the adapter. Skipping it. I0203 18:12:33.759451 1637 dxcore.c:227] Creating a new WDDM Adapter for hAdapter:40000040 luid:356dcf I0203 18:12:33.765143 1637 dxcore.c:268] Adding new adapter via dxcore hAdapter:40000040 luid:356dcf wddm version:3000 I0203 18:12:33.765181 1637 dxcore.c:326] dxcore layer initialized successfully W0203 18:12:33.765546 1637 nvc.c:401] skipping kernel modules load on WSL I0203 18:12:33.765686 1638 rpc.c:71] starting driver rpc service I0203 18:12:33.812286 1639 rpc.c:71] starting nvcgo rpc service I0203 18:12:33.817953 1637 nvc_info.c:759] requesting driver information with '' I0203 18:12:33.904704 1637 nvc_info.c:198] selecting /usr/lib/wsl/lib/libnvidia-opticalflow.so.1 I0203 18:12:33.905591 1637 nvc_info.c:198] selecting /usr/lib/wsl/lib/libnvidia-ml.so.1 I0203 18:12:33.906351 1637 nvc_info.c:198] selecting /usr/lib/wsl/lib/libnvidia-encode.so.1 I0203 18:12:33.907139 1637 nvc_info.c:198] selecting /usr/lib/wsl/lib/libnvcuvid.so.1 I0203 18:12:33.907224 1637 nvc_info.c:198] selecting /usr/lib/wsl/lib/libdxcore.so I0203 18:12:33.907257 1637 nvc_info.c:198] selecting /usr/lib/wsl/lib/libcuda.so.1 W0203 18:12:33.907319 1637 nvc_info.c:398] missing library libnvidia-cfg.so W0203 18:12:33.907338 1637 nvc_info.c:398] missing library libnvidia-nscq.so W0203 18:12:33.907341 1637 nvc_info.c:398] missing library libnvidia-opencl.so W0203 18:12:33.907343 1637 nvc_info.c:398] missing library libnvidia-ptxjitcompiler.so W0203 18:12:33.907345 1637 nvc_info.c:398] missing library libnvidia-fatbinaryloader.so W0203 18:12:33.907346 1637 nvc_info.c:398] missing library libnvidia-allocator.so W0203 18:12:33.907348 1637 nvc_info.c:398] missing library libnvidia-compiler.so W0203 18:12:33.907349 1637 nvc_info.c:398] missing library libnvidia-pkcs11.so W0203 18:12:33.907351 1637 nvc_info.c:398] missing library libnvidia-ngx.so W0203 18:12:33.907352 1637 nvc_info.c:398] missing library libvdpau_nvidia.so W0203 18:12:33.907354 1637 nvc_info.c:398] missing library libnvidia-eglcore.so W0203 18:12:33.907355 1637 nvc_info.c:398] missing library libnvidia-glcore.so W0203 18:12:33.907357 1637 nvc_info.c:398] missing library libnvidia-tls.so W0203 18:12:33.907359 1637 nvc_info.c:398] missing library libnvidia-glsi.so W0203 18:12:33.907360 1637 nvc_info.c:398] missing library libnvidia-fbc.so W0203 18:12:33.907362 1637 nvc_info.c:398] missing library libnvidia-ifr.so W0203 18:12:33.907363 1637 nvc_info.c:398] missing library libnvidia-rtcore.so W0203 18:12:33.907365 1637 nvc_info.c:398] missing library libnvoptix.so W0203 18:12:33.907366 1637 nvc_info.c:398] missing library libGLX_nvidia.so W0203 18:12:33.907368 1637 nvc_info.c:398] missing library libEGL_nvidia.so W0203 18:12:33.907369 1637 nvc_info.c:398] missing library libGLESv2_nvidia.so W0203 18:12:33.907371 1637 nvc_info.c:398] missing library libGLESv1_CM_nvidia.so W0203 18:12:33.907372 1637 nvc_info.c:398] missing library libnvidia-glvkspirv.so W0203 18:12:33.907374 1637 nvc_info.c:398] missing library libnvidia-cbl.so W0203 18:12:33.907375 1637 nvc_info.c:402] missing compat32 library libnvidia-ml.so W0203 18:12:33.907390 1637 nvc_info.c:402] missing compat32 library libnvidia-cfg.so W0203 18:12:33.907394 1637 nvc_info.c:402] missing compat32 library libnvidia-nscq.so W0203 18:12:33.907396 1637 nvc_info.c:402] missing compat32 library libcuda.so W0203 18:12:33.907399 1637 nvc_info.c:402] missing compat32 library libnvidia-opencl.so W0203 18:12:33.907414 1637 nvc_info.c:402] missing compat32 library libnvidia-ptxjitcompiler.so W0203 18:12:33.907431 1637 nvc_info.c:402] missing compat32 library libnvidia-fatbinaryloader.so W0203 18:12:33.907434 1637 nvc_info.c:402] missing compat32 library libnvidia-allocator.so W0203 18:12:33.907436 1637 nvc_info.c:402] missing compat32 library libnvidia-compiler.so W0203 18:12:33.907437 1637 nvc_info.c:402] missing compat32 library libnvidia-pkcs11.so W0203 18:12:33.907439 1637 nvc_info.c:402] missing compat32 library libnvidia-ngx.so W0203 18:12:33.907441 1637 nvc_info.c:402] missing compat32 library libvdpau_nvidia.so W0203 18:12:33.907442 1637 nvc_info.c:402] missing compat32 library libnvidia-encode.so W0203 18:12:33.907444 1637 nvc_info.c:402] missing compat32 library libnvidia-opticalflow.so W0203 18:12:33.907447 1637 nvc_info.c:402] missing compat32 library libnvcuvid.so W0203 18:12:33.907461 1637 nvc_info.c:402] missing compat32 library libnvidia-eglcore.so W0203 18:12:33.907479 1637 nvc_info.c:402] missing compat32 library libnvidia-glcore.so W0203 18:12:33.907482 1637 nvc_info.c:402] missing compat32 library libnvidia-tls.so W0203 18:12:33.907484 1637 nvc_info.c:402] missing compat32 library libnvidia-glsi.so W0203 18:12:33.907486 1637 nvc_info.c:402] missing compat32 library libnvidia-fbc.so W0203 18:12:33.907488 1637 nvc_info.c:402] missing compat32 library libnvidia-ifr.so W0203 18:12:33.907489 1637 nvc_info.c:402] missing compat32 library libnvidia-rtcore.so W0203 18:12:33.907491 1637 nvc_info.c:402] missing compat32 library libnvoptix.so W0203 18:12:33.907492 1637 nvc_info.c:402] missing compat32 library libGLX_nvidia.so W0203 18:12:33.907494 1637 nvc_info.c:402] missing compat32 library libEGL_nvidia.so W0203 18:12:33.907495 1637 nvc_info.c:402] missing compat32 library libGLESv2_nvidia.so W0203 18:12:33.907499 1637 nvc_info.c:402] missing compat32 library libGLESv1_CM_nvidia.so W0203 18:12:33.907500 1637 nvc_info.c:402] missing compat32 library libnvidia-glvkspirv.so W0203 18:12:33.907527 1637 nvc_info.c:402] missing compat32 library libnvidia-cbl.so W0203 18:12:33.907531 1637 nvc_info.c:402] missing compat32 library libdxcore.so I0203 18:12:33.908902 1637 nvc_info.c:278] selecting /usr/lib/wsl/drivers/nvlti.inf_amd64_f0a75371d3692c1a/nvidia-smi W0203 18:12:34.217108 1637 nvc_info.c:424] missing binary nvidia-debugdump W0203 18:12:34.217139 1637 nvc_info.c:424] missing binary nvidia-persistenced W0203 18:12:34.217143 1637 nvc_info.c:424] missing binary nv-fabricmanager W0203 18:12:34.217144 1637 nvc_info.c:424] missing binary nvidia-cuda-mps-control W0203 18:12:34.217146 1637 nvc_info.c:424] missing binary nvidia-cuda-mps-server I0203 18:12:34.217164 1637 nvc_info.c:439] skipping path lookup for dxcore I0203 18:12:34.217179 1637 nvc_info.c:522] listing device /dev/dxg W0203 18:12:34.217207 1637 nvc_info.c:348] missing ipc path /var/run/nvidia-persistenced/socket W0203 18:12:34.217248 1637 nvc_info.c:348] missing ipc path /var/run/nvidia-fabricmanager/socket W0203 18:12:34.217278 1637 nvc_info.c:348] missing ipc path /tmp/nvidia-mps I0203 18:12:34.217299 1637 nvc_info.c:815] requesting device information with '' I0203 18:12:34.227700 1637 nvc_info.c:687] listing dxcore adapter 0 (GPU-b5e386b4-3e71-5837-aca5-80c5914cf07f at 00000000:01:00.0) NVRM version: 510.06 CUDA version: 11.6
Device Index: 0 Device Minor: 0 Model: NVIDIA GeForce GTX 1650 Ti with Max-Q Design Brand: GeForce GPU UUID: GPU-b5e386b4-3e71-5837-aca5-80c5914cf07f Bus Location: 00000000:01:00.0 Architecture: 7.5 I0203 18:12:34.227772 1637 nvc.c:430] shutting down library context I0203 18:12:34.227859 1639 rpc.c:95] terminating nvcgo rpc service I0203 18:12:34.228242 1637 rpc.c:135] nvcgo rpc service terminated successfully I0203 18:12:34.229403 1638 rpc.c:95] terminating driver rpc service I0203 18:12:34.230364 1637 rpc.c:135] driver rpc service terminated successfully
$ uname -a Linux LAPTOP-E1MFF41S 4.19.128-microsoft-standard NVIDIA/nvidia-docker#1 SMP Tue Jun 23 12:58:10 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$ nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Thu Feb 3 13:18:48 2022 Driver Version : 510.06 CUDA Version : 11.6
Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : NVIDIA GeForce GTX 1650 Ti with Max-Q Design Product Brand : GeForce Product Architecture : Turing Display Mode : Enabled Display Active : Enabled Persistence Mode : Enabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : WDDM Pending : WDDM Serial Number : N/A GPU UUID : GPU-b5e386b4-3e71-5837-aca5-80c5914cf07f Minor Number : N/A VBIOS Version : 90.17.41.00.46 MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Module ID : 0 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1F9510DE Bus Id : 00000000:01:00.0 Sub System Id : 0x22C017AA GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 218000 KB/s Rx Throughput : 1000 KB/s Fan Speed : N/A Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 4096 MiB Used : 1337 MiB Free : 2759 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : N/A Memory : N/A Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 40 C GPU Shutdown Temp : 102 C GPU Slowdown Temp : 97 C GPU Max Operating Temp : 75 C GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : N/A Power Draw : 3.99 W Power Limit : N/A Default Power Limit : N/A Enforced Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 77 MHz SM : 77 MHz Memory : 197 MHz Video : 540 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2100 MHz SM : 2100 MHz Memory : 5001 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes : None
Client: Version: 20.10.7 API version: 1.41 Go version: go1.13.8 Git commit: 20.10.7-0ubuntu5~20.04.2 Built: Mon Nov 1 00:34:17 2021 OS/Arch: linux/amd64 Context: default Experimental: true
Server: Docker Engine - Community Engine: Version: 20.10.12 API version: 1.41 (minimum version 1.12) Go version: go1.16.12 Git commit: 459d0df Built: Mon Dec 13 11:43:56 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.12 GitCommit: 7b11cfaabd73bb80907dd23182b9347b4245eb5d runc: Version: 1.0.2 GitCommit: v1.0.2-0-g52b36a2 docker-init: Version: 0.19.0 GitCommit: de40ad0
$ dpkg -l 'nvidia' (no description available)
ii libnvidia-container-tools 1.8.0~rc.2-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.8.0~rc.2-1 amd64 NVIDIA container runtime library
un nvidia-common (no description available)
un nvidia-container-runtime (no description available)
un nvidia-container-runtime-hook (no description available)
ii nvidia-container-toolkit 1.8.0~rc.2-1 amd64 NVIDIA container runtime hook
un nvidia-docker (no description available)
ii nvidia-docker2 2.8.0-1 all nvidia-docker CLI wrapper
un nvidia-legacy-304xx-vdpau-driver (no description available)
un nvidia-legacy-340xx-vdpau-driver (no description available)
un nvidia-libopencl1-dev (no description available)
un nvidia-prime (no description available)
un nvidia-vdpau-driver (no description available)
_or_
rpm -qa 'nvidia' or: command not found dpkg-query: no packages found matching nvidiarpm dpkg-query: no packages found matching -qa Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-================================-============-============-===================================================== un libgldispatch0-nvidia$ nvidia-container-cli -V cli-version: 1.8.0~rc.2 lib-version: 1.8.0~rc.2 build date: 2022-01-28T10:54+00:00 build revision: d48f9b0d505fca0aff7c88cee790f9c56aa1b851 build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections