NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.23k stars 243 forks source link

Issue running Clara on WSL2+Ubuntu 20.4+Docker: merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown. #287

Open akemisetti opened 2 years ago

akemisetti commented 2 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Also, before reporting a new issue, please make sure that:


1. Issue or feature description

Clara v4,.0 does not run on WLS2+Ubuntu20.4 errors out. I got the following error

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/c78f7e8d06e54ac4efaf4d12915bbf305449899fd5a3f2a40126f16f26a8f54c/merged/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file exists: unknown.

2. Steps to reproduce the issue

Launch the docker container and it errors out. Followed the steps to configure nvidia docker on WLS using the steps mentioned in https://docs.nvidia.com/cuda/wsl-user-guide/index.html. The setup went fine.

3. Information to attach (optional if deemed irrelevant)

I1013 02:45:29.787970 606 nvc.c:372] initializing library context (version=1.5.1, build=4afad130c4c253abd3b2db563ffe9331594bda41) I1013 02:45:29.787993 606 nvc.c:346] using root / I1013 02:45:29.787995 606 nvc.c:347] using ldcache /etc/ld.so.cache I1013 02:45:29.787997 606 nvc.c:348] using unprivileged user 1000:1000 I1013 02:45:29.788008 606 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I1013 02:45:29.803470 606 dxcore.c:227] Creating a new WDDM Adapter for hAdapter:40000000 luid:2d610d I1013 02:45:29.809296 606 dxcore.c:210] Core Nvidia component libcuda.so.1.1 not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_e6610765cda2bce8 I1013 02:45:29.810241 606 dxcore.c:210] Core Nvidia component libcuda_loader.so not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_e6610765cda2bce8
I1013 02:45:29.811035 606 dxcore.c:210] Core Nvidia component libnvidia-ptxjitcompiler.so.1 not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_e6610765cda2bce8 I1013 02:45:29.811743 606 dxcore.c:210] Core Nvidia component libnvidia-ml.so.1 not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_e6610765cda2bce8
I1013 02:45:29.812464 606 dxcore.c:210] Core Nvidia component libnvidia-ml_loader.so not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_e6610765cda2bce8
I1013 02:45:29.813196 606 dxcore.c:210] Core Nvidia component nvidia-smi not found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_e6610765cda2bce8 I1013 02:45:29.813246 606 dxcore.c:215] No Nvidia component found in /usr/lib/wsl/drivers/iigd_dch.inf_amd64_e6610765cda2bce8 E1013 02:45:29.813249 606 dxcore.c:261] Failed to query the core Nvidia libraries for the adapter. Skipping it. I1013 02:45:29.813252 606 dxcore.c:227] Creating a new WDDM Adapter for hAdapter:40000040 luid:2e596b I1013 02:45:29.820258 606 dxcore.c:268] Adding new adapter via dxcore hAdapter:40000040 luid:2e596b wddm version:3000 I1013 02:45:29.820294 606 dxcore.c:326] dxcore layer initialized successfully W1013 02:45:29.820582 606 nvc.c:397] skipping kernel modules load on WSL I1013 02:45:29.820755 607 driver.c:101] starting driver service I1013 02:45:29.865297 606 nvc_info.c:758] requesting driver information with '' I1013 02:45:29.958605 606 nvc_info.c:197] selecting /usr/lib/wsl/lib/libnvidia-opticalflow.so.1 I1013 02:45:29.959908 606 nvc_info.c:197] selecting /usr/lib/wsl/lib/libnvidia-ml.so.1 I1013 02:45:29.961220 606 nvc_info.c:197] selecting /usr/lib/wsl/lib/libnvidia-encode.so.1 I1013 02:45:29.962268 606 nvc_info.c:197] selecting /usr/lib/wsl/lib/libnvcuvid.so.1 I1013 02:45:29.962374 606 nvc_info.c:197] selecting /usr/lib/wsl/lib/libdxcore.so I1013 02:45:29.962416 606 nvc_info.c:197] selecting /usr/lib/wsl/lib/libcuda.so.1 W1013 02:45:29.962482 606 nvc_info.c:397] missing library libnvidia-cfg.so W1013 02:45:29.962502 606 nvc_info.c:397] missing library libnvidia-nscq.so W1013 02:45:29.962506 606 nvc_info.c:397] missing library libnvidia-opencl.so W1013 02:45:29.962508 606 nvc_info.c:397] missing library libnvidia-ptxjitcompiler.so W1013 02:45:29.962510 606 nvc_info.c:397] missing library libnvidia-fatbinaryloader.so W1013 02:45:29.962512 606 nvc_info.c:397] missing library libnvidia-allocator.so W1013 02:45:29.962514 606 nvc_info.c:397] missing library libnvidia-compiler.so W1013 02:45:29.962515 606 nvc_info.c:397] missing library libnvidia-ngx.so W1013 02:45:29.962517 606 nvc_info.c:397] missing library libvdpau_nvidia.so W1013 02:45:29.962519 606 nvc_info.c:397] missing library libnvidia-eglcore.so W1013 02:45:29.962521 606 nvc_info.c:397] missing library libnvidia-glcore.so W1013 02:45:29.962523 606 nvc_info.c:397] missing library libnvidia-tls.so W1013 02:45:29.962525 606 nvc_info.c:397] missing library libnvidia-glsi.so W1013 02:45:29.962527 606 nvc_info.c:397] missing library libnvidia-fbc.so W1013 02:45:29.962528 606 nvc_info.c:397] missing library libnvidia-ifr.so W1013 02:45:29.962530 606 nvc_info.c:397] missing library libnvidia-rtcore.so W1013 02:45:29.962532 606 nvc_info.c:397] missing library libnvoptix.so W1013 02:45:29.962534 606 nvc_info.c:397] missing library libGLX_nvidia.so W1013 02:45:29.962536 606 nvc_info.c:397] missing library libEGL_nvidia.so W1013 02:45:29.962538 606 nvc_info.c:397] missing library libGLESv2_nvidia.so W1013 02:45:29.962553 606 nvc_info.c:397] missing library libGLESv1_CM_nvidia.so W1013 02:45:29.962557 606 nvc_info.c:397] missing library libnvidia-glvkspirv.so W1013 02:45:29.962559 606 nvc_info.c:397] missing library libnvidia-cbl.so W1013 02:45:29.962578 606 nvc_info.c:401] missing compat32 library libnvidia-ml.so W1013 02:45:29.962594 606 nvc_info.c:401] missing compat32 library libnvidia-cfg.so W1013 02:45:29.962601 606 nvc_info.c:401] missing compat32 library libnvidia-nscq.so W1013 02:45:29.962616 606 nvc_info.c:401] missing compat32 library libcuda.so W1013 02:45:29.962634 606 nvc_info.c:401] missing compat32 library libnvidia-opencl.so W1013 02:45:29.962639 606 nvc_info.c:401] missing compat32 library libnvidia-ptxjitcompiler.so W1013 02:45:29.962642 606 nvc_info.c:401] missing compat32 library libnvidia-fatbinaryloader.so W1013 02:45:29.962658 606 nvc_info.c:401] missing compat32 library libnvidia-allocator.so W1013 02:45:29.962679 606 nvc_info.c:401] missing compat32 library libnvidia-compiler.so W1013 02:45:29.962684 606 nvc_info.c:401] missing compat32 library libnvidia-ngx.so W1013 02:45:29.962687 606 nvc_info.c:401] missing compat32 library libvdpau_nvidia.so W1013 02:45:29.962690 606 nvc_info.c:401] missing compat32 library libnvidia-encode.so W1013 02:45:29.962693 606 nvc_info.c:401] missing compat32 library libnvidia-opticalflow.so W1013 02:45:29.962698 606 nvc_info.c:401] missing compat32 library libnvcuvid.so W1013 02:45:29.962700 606 nvc_info.c:401] missing compat32 library libnvidia-eglcore.so W1013 02:45:29.962702 606 nvc_info.c:401] missing compat32 library libnvidia-glcore.so W1013 02:45:29.962704 606 nvc_info.c:401] missing compat32 library libnvidia-tls.so W1013 02:45:29.962706 606 nvc_info.c:401] missing compat32 library libnvidia-glsi.so W1013 02:45:29.962708 606 nvc_info.c:401] missing compat32 library libnvidia-fbc.so W1013 02:45:29.962710 606 nvc_info.c:401] missing compat32 library libnvidia-ifr.so W1013 02:45:29.962712 606 nvc_info.c:401] missing compat32 library libnvidia-rtcore.so W1013 02:45:29.962714 606 nvc_info.c:401] missing compat32 library libnvoptix.so W1013 02:45:29.962729 606 nvc_info.c:401] missing compat32 library libGLX_nvidia.so W1013 02:45:29.962733 606 nvc_info.c:401] missing compat32 library libEGL_nvidia.so W1013 02:45:29.962735 606 nvc_info.c:401] missing compat32 library libGLESv2_nvidia.so W1013 02:45:29.962737 606 nvc_info.c:401] missing compat32 library libGLESv1_CM_nvidia.so W1013 02:45:29.962739 606 nvc_info.c:401] missing compat32 library libnvidia-glvkspirv.so W1013 02:45:29.962741 606 nvc_info.c:401] missing compat32 library libnvidia-cbl.so W1013 02:45:29.962744 606 nvc_info.c:401] missing compat32 library libdxcore.so I1013 02:45:29.964192 606 nvc_info.c:277] selecting /usr/lib/wsl/drivers/nvdmwi.inf_amd64_53b6a0a2497c9235/nvidia-smi W1013 02:45:30.230696 606 nvc_info.c:423] missing binary nvidia-debugdump W1013 02:45:30.230729 606 nvc_info.c:423] missing binary nvidia-persistenced W1013 02:45:30.230733 606 nvc_info.c:423] missing binary nv-fabricmanager W1013 02:45:30.230735 606 nvc_info.c:423] missing binary nvidia-cuda-mps-control W1013 02:45:30.230737 606 nvc_info.c:423] missing binary nvidia-cuda-mps-server I1013 02:45:30.230755 606 nvc_info.c:437] skipping path lookup for dxcore I1013 02:45:30.230764 606 nvc_info.c:520] listing device /dev/dxg W1013 02:45:30.230793 606 nvc_info.c:347] missing ipc path /var/run/nvidia-persistenced/socket W1013 02:45:30.230804 606 nvc_info.c:347] missing ipc path /var/run/nvidia-fabricmanager/socket W1013 02:45:30.230832 606 nvc_info.c:347] missing ipc path /tmp/nvidia-mps I1013 02:45:30.230851 606 nvc_info.c:814] requesting device information with '' I1013 02:45:30.243786 606 nvc_info.c:686] listing dxcore adapter 0 (GPU-db227b23-d556-36af-b5b5-1de7cb718915 at 00000000:01:00.0) NVRM version: 510.06 CUDA version: 11.6

Device Index: 0 Device Minor: 0 Model: Quadro RTX 3000 Brand: Unknown GPU UUID: GPU-db227b23-d556-36af-b5b5-1de7cb718915 Bus Location: 00000000:01:00.0 Architecture: 7.5 I1013 02:45:30.243858 606 nvc.c:423] shutting down library context I1013 02:45:30.245250 607 driver.c:163] terminating driver service I1013 02:45:30.247501 606 driver.c:203] driver service terminated successfully

==============NVSMI LOG==============

Timestamp : Tue Oct 12 19:40:40 2021 Driver Version : 510.06 CUDA Version : 11.6

Attached GPUs : 1 GPU 00000000:01:00.0 Product Name : Quadro RTX 3000 Product Brand : Quadro RTX Product Architecture : Turing Display Mode : Enabled Display Active : Enabled Persistence Mode : Enabled MIG Mode Current : N/A Pending : N/A Accounting Mode : Disabled Accounting Mode Buffer Size : 4000 Driver Model Current : WDDM Pending : WDDM Serial Number : N/A GPU UUID : GPU-db227b23-d556-36af-b5b5-1de7cb718915 Minor Number : N/A VBIOS Version : 90.06.39.00.6f MultiGPU Board : No Board ID : 0x100 GPU Part Number : N/A Module ID : 0 Inforom Version Image Version : G001.0000.02.04 OEM Object : 1.1 ECC Object : N/A Power Management Object : N/A GPU Operation Mode Current : N/A Pending : N/A GSP Firmware Version : N/A GPU Virtualization Mode Virtualization Mode : None Host VGPU Mode : N/A IBMNPU Relaxed Ordering Mode : N/A PCI Bus : 0x01 Device : 0x00 Domain : 0x0000 Device Id : 0x1F3610DE Bus Id : 00000000:01:00.0 Sub System Id : 0x09261028 GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : 27000 KB/s Rx Throughput : 0 KB/s Fan Speed : N/A Performance State : P8 Clocks Throttle Reasons Idle : Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : Not Active HW Power Brake Slowdown : Not Active Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 6144 MiB Used : 790 MiB Free : 5354 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : N/A Memory : N/A Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A Max Power Limit : N/A Clocks Graphics : 48 MHz SM : 48 MHz Memory : 130 MHz Video : 540 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 2100 MHz SM : 2100 MHz Memory : 7001 MHz Video : 1950 MHz Max Customer Boost Clocks Graphics : Unknown Error Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes : None

===================================

Client: Version: 20.10.7 API version: 1.41 Go version: go1.13.8 Git commit: 20.10.7-0ubuntu1~20.04.2 Built: Fri Oct 1 14:07:06 2021 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Docker Engine - Community Engine: Version: 20.10.8 API version: 1.41 (minimum version 1.12) Go version: go1.16.6 Git commit: 75249d8 Built: Fri Jul 30 19:52:31 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.9 GitCommit: e25210fe30a0a703442421b0f60afac609f950a3 runc: Version: 1.0.1 GitCommit: v1.0.1-0-g4144b63 docker-init: Version: 0.19.0 GitCommit: de40ad0

Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-=============================-============-============-===================================================== un libgldispatch0-nvidia (no description available) ii libnvidia-container-tools 1.5.1-1 amd64 NVIDIA container runtime library (command-line tools) ii libnvidia-container1:amd64 1.5.1-1 amd64 NVIDIA container runtime library ii nvidia-container-runtime 3.5.0-1 amd64 NVIDIA container runtime un nvidia-container-runtime-hook (no description available) ii nvidia-container-toolkit 1.5.1-1 amd64 NVIDIA container runtime hook un nvidia-docker (no description available) ii nvidia-docker2 2.6.0-1 all nvidia-docker CLI wrapper

elezar commented 2 years ago

This seems to be a duplicate of NVIDIA/nvidia-container-toolkit#289. I will check the nvcr.io/nvidia/clara-train-sdk:v4.0 image to see whether it already contains the files as discussed there.

akemisetti commented 2 years ago

@elezar Thanks for pointing to the existing issue.

The solution suggested in NVIDIA/nvidia-container-toolkit#289 worked for me. Copying it here.

docker run --privileged the image then execute unmount & rm to get rid of libnvidia and libcuda files then docker commit to save a new image when I run this new image with --gpus all --runtime=nvidia options, it doesn't give error any more

Opdoop commented 2 years ago

To build a new image, more specifically:

FROM <the image you care about>

RUN rm -rf /usr/lib/x86_64-linux-gnu/libnvidia* /usr/lib/x86_64-linux-gnu/libcuda*

Then run docker run -it --gpus all [the new image:tag] command, it uses GPU successfully.