Closed patelrohan008 closed 1 year ago
It seems as if the modulus
image was built with the NVIDIA Container Runtime enabled, and as such the files injected (or created) by the NVIDIA Container CLI still exist in the image. If you remove these files, you should be able to continue.
Assuming that you're talking about removing the libnvidia-ml.so.1, libcuda.so.1, etc. files, I already tried the solution which I linked in the original post. Creating a new image which removes those files results in an image which isn't able to detect any CUDA capable devices and as a result doesn't use the GPU. As a side note, I'm not sure if the issue is with how I'm running containers, I have no issues with the PyTorch container, which I load and run the same way
@elezar It seems like I'm not alone with regards to the proposed solution resulting in a container which isn't able to utilize the GPU. Several people on the developer forums experienced a similar issue with different image, and attempted the same solution only to have a container which was unable to use the GPU
I also faced exactly the same problem (failed to run modulus:22.09
with WSL2).
As mentioned in https://github.com/NVIDIA/nvidia-container-toolkit/issues/289,
we can run successfully if we remove related files that will be inserted with docker run --gpus=all
. However, GPU seems disabled with this workaround in my case (like https://github.com/NVIDIA/nvidia-container-toolkit/issues/289).
After further struggle I finally found a true workaround for modulus:22.09
; building the image with
FROM modulus:22.09
RUN rm -rf \
/usr/lib/x86_64-linux-gnu/libcuda.so* \
/usr/lib/x86_64-linux-gnu/libnvcuvid.so* \
/usr/lib/x86_64-linux-gnu/libnvidia-*.so* \
/usr/lib/firmware \
/usr/local/cuda/compat/lib
and GPU is enabled now :)
$ sudo docker build -t modulus:22.09.0hotfix - <./Dockerfile
...
$ sudo docker run -it --rm --gpus all modulus:22.09.0hotfix python3
=============
== PyTorch ==
=============
NVIDIA Release 22.08 (build 42105213)
PyTorch Version 1.13.0a0+d321be6
Container image Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Copyright (c) 2014-2022 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
NOTE: The SHMEM allocation limit is set to the default of 64MB. This may be
insufficient for PyTorch. NVIDIA recommends the use of the following flags:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 ...
Python 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:10)
[GCC 10.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True
It seems that /usr/local/cuda/compat/lib
was preventing CUDA from running because it contains shared object for the specific driver version 515.65.01
.
Also, I think the layer that executes /bin/bash -cu apt-get update && apt-get install -y git-lfs graphviz libgl1 && git lfs install
is the culprit of the problem because files above were created by that layer.
Thank you @atomicky ! This fixed the problem for me and thus I'm marking this issue as closed
thank you !!!
1. Issue or feature description
Docker container version of Modulus 22.09 doesn't run on WSL2 with Ubuntu version 20.04, yields the following error
Please note that I believe this is the same issue encountered in NVIDIA/nvidia-container-toolkit#289 and NVIDIA/nvidia-container-toolkit#287 , similar to each of those issues running
docker run --gpus all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
works without any issues. I have attempted the suggested solution and created a new image which removes the problematic files. However, doing so results in the container failing to detect any CUDA capable devices, and any executed code fails to utilize the GPU. Running the container with the --runtime nvidia --gpus all flags results in the container running without error, but yields the same issue of being unable to utilize the GPU. This issue has been previously mentioned on the Modulus developer forums, and the response seems to suggest the problem is with nvidia-docker, not the container itself.2. Steps to reproduce the issue
docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \ --runtime nvidia -v ${PWD}/examples:/examples \ -it --rm modulus:22.09 bash
3. Information to attach (optional if deemed irrelevant)
[x] Some nvidia-container information:
nvidia-container-cli -k -d /dev/tty info
-- WARNING, the following logs are for debugging purposes only --
I1026 21:07:24.462934 4665 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977) I1026 21:07:24.462974 4665 nvc.c:350] using root / I1026 21:07:24.462990 4665 nvc.c:351] using ldcache /etc/ld.so.cache I1026 21:07:24.462993 4665 nvc.c:352] using unprivileged user 1000:1000 I1026 21:07:24.463004 4665 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I1026 21:07:24.481144 4665 dxcore.c:227] Creating a new WDDM Adapter for hAdapter:40000000 luid:9e89c3 I1026 21:07:24.489958 4665 dxcore.c:268] Adding new adapter via dxcore hAdapter:40000000 luid:9e89c3 wddm version:2700 I1026 21:07:24.489988 4665 dxcore.c:325] dxcore layer initialized successfully W1026 21:07:24.490253 4665 nvc.c:405] skipping kernel modules load on WSL I1026 21:07:24.490393 4666 rpc.c:71] starting driver rpc service I1026 21:07:24.524009 4667 rpc.c:71] starting nvcgo rpc service I1026 21:07:24.524588 4665 nvc_info.c:766] requesting driver information with '' I1026 21:07:24.606205 4665 nvc_info.c:199] selecting /usr/lib/wsl/lib/libnvidia-opticalflow.so.1 I1026 21:07:24.606880 4665 nvc_info.c:199] selecting /usr/lib/wsl/lib/libnvidia-ml.so.1 I1026 21:07:24.607533 4665 nvc_info.c:199] selecting /usr/lib/wsl/lib/libnvidia-encode.so.1 I1026 21:07:24.608182 4665 nvc_info.c:199] selecting /usr/lib/wsl/lib/libnvcuvid.so.1 I1026 21:07:24.608257 4665 nvc_info.c:199] selecting /usr/lib/wsl/lib/libdxcore.so I1026 21:07:24.608292 4665 nvc_info.c:199] selecting /usr/lib/wsl/lib/libcuda.so.1 W1026 21:07:24.608345 4665 nvc_info.c:399] missing library libnvidia-cfg.so W1026 21:07:24.608361 4665 nvc_info.c:399] missing library libnvidia-nscq.so W1026 21:07:24.608364 4665 nvc_info.c:399] missing library libcudadebugger.so W1026 21:07:24.608390 4665 nvc_info.c:399] missing library libnvidia-opencl.so W1026 21:07:24.608393 4665 nvc_info.c:399] missing library libnvidia-ptxjitcompiler.so W1026 21:07:24.608396 4665 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so W1026 21:07:24.608397 4665 nvc_info.c:399] missing library libnvidia-allocator.so W1026 21:07:24.608399 4665 nvc_info.c:399] missing library libnvidia-compiler.so W1026 21:07:24.608401 4665 nvc_info.c:399] missing library libnvidia-pkcs11.so W1026 21:07:24.608402 4665 nvc_info.c:399] missing library libnvidia-ngx.so W1026 21:07:24.608404 4665 nvc_info.c:399] missing library libvdpau_nvidia.so W1026 21:07:24.608406 4665 nvc_info.c:399] missing library libnvidia-eglcore.so W1026 21:07:24.608420 4665 nvc_info.c:399] missing library libnvidia-glcore.so W1026 21:07:24.608424 4665 nvc_info.c:399] missing library libnvidia-tls.so W1026 21:07:24.608426 4665 nvc_info.c:399] missing library libnvidia-glsi.so W1026 21:07:24.608428 4665 nvc_info.c:399] missing library libnvidia-fbc.so W1026 21:07:24.608430 4665 nvc_info.c:399] missing library libnvidia-ifr.so W1026 21:07:24.608432 4665 nvc_info.c:399] missing library libnvidia-rtcore.so W1026 21:07:24.608433 4665 nvc_info.c:399] missing library libnvoptix.so W1026 21:07:24.608435 4665 nvc_info.c:399] missing library libGLX_nvidia.so W1026 21:07:24.608437 4665 nvc_info.c:399] missing library libEGL_nvidia.so W1026 21:07:24.608438 4665 nvc_info.c:399] missing library libGLESv2_nvidia.so W1026 21:07:24.608440 4665 nvc_info.c:399] missing library libGLESv1_CM_nvidia.so W1026 21:07:24.608441 4665 nvc_info.c:399] missing library libnvidia-glvkspirv.so W1026 21:07:24.608443 4665 nvc_info.c:399] missing library libnvidia-cbl.so W1026 21:07:24.608445 4665 nvc_info.c:403] missing compat32 library libnvidia-ml.so W1026 21:07:24.608447 4665 nvc_info.c:403] missing compat32 library libnvidia-cfg.so W1026 21:07:24.608462 4665 nvc_info.c:403] missing compat32 library libnvidia-nscq.so W1026 21:07:24.608465 4665 nvc_info.c:403] missing compat32 library libcuda.so W1026 21:07:24.608467 4665 nvc_info.c:403] missing compat32 library libcudadebugger.so W1026 21:07:24.608469 4665 nvc_info.c:403] missing compat32 library libnvidia-opencl.so W1026 21:07:24.608471 4665 nvc_info.c:403] missing compat32 library libnvidia-ptxjitcompiler.so W1026 21:07:24.608473 4665 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so W1026 21:07:24.608474 4665 nvc_info.c:403] missing compat32 library libnvidia-allocator.so W1026 21:07:24.608476 4665 nvc_info.c:403] missing compat32 library libnvidia-compiler.so W1026 21:07:24.608478 4665 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so W1026 21:07:24.608479 4665 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W1026 21:07:24.608481 4665 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so W1026 21:07:24.608484 4665 nvc_info.c:403] missing compat32 library libnvidia-encode.so W1026 21:07:24.608511 4665 nvc_info.c:403] missing compat32 library libnvidia-opticalflow.so W1026 21:07:24.608514 4665 nvc_info.c:403] missing compat32 library libnvcuvid.so W1026 21:07:24.608530 4665 nvc_info.c:403] missing compat32 library libnvidia-eglcore.so W1026 21:07:24.608546 4665 nvc_info.c:403] missing compat32 library libnvidia-glcore.so W1026 21:07:24.608548 4665 nvc_info.c:403] missing compat32 library libnvidia-tls.so W1026 21:07:24.608564 4665 nvc_info.c:403] missing compat32 library libnvidia-glsi.so W1026 21:07:24.608567 4665 nvc_info.c:403] missing compat32 library libnvidia-fbc.so W1026 21:07:24.608569 4665 nvc_info.c:403] missing compat32 library libnvidia-ifr.so W1026 21:07:24.608571 4665 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W1026 21:07:24.608573 4665 nvc_info.c:403] missing compat32 library libnvoptix.so W1026 21:07:24.608575 4665 nvc_info.c:403] missing compat32 library libGLX_nvidia.so W1026 21:07:24.608589 4665 nvc_info.c:403] missing compat32 library libEGL_nvidia.so W1026 21:07:24.608591 4665 nvc_info.c:403] missing compat32 library libGLESv2_nvidia.so W1026 21:07:24.608606 4665 nvc_info.c:403] missing compat32 library libGLESv1_CM_nvidia.so W1026 21:07:24.608621 4665 nvc_info.c:403] missing compat32 library libnvidia-glvkspirv.so W1026 21:07:24.608624 4665 nvc_info.c:403] missing compat32 library libnvidia-cbl.so W1026 21:07:24.608626 4665 nvc_info.c:403] missing compat32 library libdxcore.so I1026 21:07:24.610022 4665 nvc_info.c:279] selecting /usr/lib/wsl/drivers/nv_dispui.inf_amd64_f2b06cc19dadc00f/nvidia-smi W1026 21:07:25.541539 4665 nvc_info.c:425] missing binary nvidia-debugdump W1026 21:07:25.541567 4665 nvc_info.c:425] missing binary nvidia-persistenced W1026 21:07:25.541570 4665 nvc_info.c:425] missing binary nv-fabricmanager W1026 21:07:25.541571 4665 nvc_info.c:425] missing binary nvidia-cuda-mps-control W1026 21:07:25.541573 4665 nvc_info.c:425] missing binary nvidia-cuda-mps-server I1026 21:07:25.541591 4665 nvc_info.c:441] skipping path lookup for dxcore I1026 21:07:25.541598 4665 nvc_info.c:529] listing device /dev/dxg W1026 21:07:25.541623 4665 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket W1026 21:07:25.541649 4665 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket W1026 21:07:25.541669 4665 nvc_info.c:349] missing ipc path /tmp/nvidia-mps I1026 21:07:25.541685 4665 nvc_info.c:822] requesting device information with '' I1026 21:07:25.552505 4665 nvc_info.c:694] listing dxcore adapter 0 (GPU-dbbd71f6-7bf3-4280-3674-7d2f6ce7558e at 00000000:65:00.0) NVRM version: 522.06 CUDA version: 11.8
Device Index: 0 Device Minor: 0 Model: Quadro RTX 5000 Brand: QuadroRTX GPU UUID: GPU-dbbd71f6-7bf3-4280-3674-7d2f6ce7558e Bus Location: 00000000:65:00.0 Architecture: 7.5 I1026 21:07:25.552580 4665 nvc.c:434] shutting down library context I1026 21:07:25.552637 4667 rpc.c:95] terminating nvcgo rpc service I1026 21:07:25.553008 4665 rpc.c:135] nvcgo rpc service terminated successfully I1026 21:07:25.554398 4666 rpc.c:95] terminating driver rpc service I1026 21:07:25.556557 4665 rpc.c:135] driver rpc service terminated successfully
[x] Kernel version from
uname -a
Linux dpr5820-009 5.10.102.1-microsoft-standard-WSL2
[ ] Any relevant kernel output lines from
dmesg
[x] Driver information from
nvidia-smi -a
==============NVSMI LOG==============
Timestamp : Wed Oct 26 14:09:05 2022 Driver Version : 522.06 CUDA Version : 11.8
Attached GPUs : 1 GPU 00000000:65:00.0 Product Name : Quadro RTX 5000 Product Brand : Quadro RTX Product Architecture : Turing Display Mode : Enabled Display Active : Enabled Persistence Mode : Enabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : WDDM Pending : WDDM Serial Number : 1562621002428 GPU UUID : GPU-dbbd71f6-7bf3-4280-3674-7d2f6ce7558e Minor Number : N/A VBIOS Version : 90.04.99.00.03 MultiGPU Board : No Board ID : 0x6500 GPU Part Number : 900-5G180-0100-032 Module ID : 0 Inforom Version Image Version : G180.0500.00.02 OEM Object : 1.1 ECC Object : 5.0 Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x65 Device : 0x00 Domain : 0x0000 Device Id : 0x1EB010DE Bus Id : 00000000:65:00.0 Sub System Id : 0x129F1028 GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 34000 KB/s Rx Throughput : 60000 KB/s Fan Speed : 49 % Performance State : P2 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 16384 MiB Reserved : 214 MiB Used : 9679 MiB Free : 6490 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 91 % Memory : 70 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : Disabled Pending : Disabled ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Retired Pages Single Bit ECC : 0 Double Bit ECC : 0 Pending Page Blacklist : No Remapped Rows : N/A Temperature GPU Current Temp : 74 C GPU Shutdown Temp : 96 C GPU Slowdown Temp : 93 C GPU Max Operating Temp : 89 C GPU Target Temperature : 83 C Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : Supported Power Draw : 151.91 W Power Limit : 230.00 W Default Power Limit : 230.00 W Enforced Power Limit : 230.00 W Min Power Limit : 125.00 W Max Power Limit : 230.00 W Clocks Graphics : 1844 MHz SM : 1844 MHz Memory : 6494 MHz Video : 1711 MHz Applications Clocks Graphics : 1620 MHz Memory : 7001 MHz Default Applications Clocks Graphics : 1620 MHz Memory : 7001 MHz Max Clocks Graphics : 2100 MHz SM : 2100 MHz Memory : 7001 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes GPU instance ID : N/A Compute instance ID : N/A Process ID : 711 Type : C Name : /python3.8 Used GPU Memory : Not available in WDDM driver model
[x] Docker version from
docker version
Client: Docker Engine - Community Version: 20.10.20 API version: 1.41 Go version: go1.18.7 Git commit: 9fdeb9c Built: Tue Oct 18 18:20:23 2022 OS/Arch: linux/amd64 Context: default Experimental: true
Server: Docker Engine - Community Engine: Version: 20.10.20 API version: 1.41 (minimum version 1.12) Go version: go1.18.7 Git commit: 03df974 Built: Tue Oct 18 18:18:12 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.6.8 GitCommit: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6 runc: Version: 1.1.4 GitCommit: v1.1.4-0-g5fd4c4d docker-init: Version: 0.19.0 GitCommit: de40ad0
[x] NVIDIA packages version from
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=============================-============-============-===================================================== un libgldispatch0-nvidia (no description available)
ii libnvidia-container-tools 1.11.0-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.11.0-1 amd64 NVIDIA container runtime library
un nvidia-container-runtime (no description available)
un nvidia-container-runtime-hook (no description available)
ii nvidia-container-toolkit 1.11.0-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.11.0-1 amd64 NVIDIA Container Toolkit Base
un nvidia-docker (no description available)
ii nvidia-docker2 2.11.0-1 all nvidia-docker CLI wrapper
[x] NVIDIA container library version from
nvidia-container-cli -V
cli-version: 1.11.0 lib-version: 1.11.0 build date: 2022-09-06T09:21+00:00 build revision: c8f267be0bac1c654d59ad4ea5df907141149977 build compiler: x86_64-linux-gnu-gcc-7 7.5.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
[ ] NVIDIA container library logs (see troubleshooting)
[x] Docker command, image and tag used
nvcr.io/nvidia/modulus/modulus:22.09