NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.29k stars 246 forks source link

Reinstalling Nvidia-Docker: Not able to run Nvidia Toolkit Containers (Jetson Nano - Jetpack 4.5.1) #294

Closed kaisark closed 8 months ago

kaisark commented 3 years ago

dpkg-nvidia.log

1. Issue or feature description

Reinstalling Nvidia-Docker not able to run Nvidia Toolkit Containers (Jetson Nano - Jetpack 4.5.1)

2. Steps to reproduce the issue

  1. Remove Docker (sudo apt remove docker)
  2. Remove nvidia-docker2 (sudo apt remove nvidia-docker2)
  3. Reinstall Docker (https://docs.docker.com/engine/install/ubuntu/)
  4. Reinstall nvidia-docker2 (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker)
  5. Run Nvidia Container (Arm64/aarch64) docker run --gpus all -it --rm --network host --volume ~/nvdli-data:/nvdli-nano/data --device /dev/video0 nvcr.io/nvidia/dli/dli-nano-ai:v2.0.1-r32.5.0

3. Information to attach (optional if deemed irrelevant)

-- WARNING, the following logs are for debugging purposes only --

I0412 19:56:22.542498 6496 nvc.c:372] initializing library context (version=1.3.3, build=bd9fc3f2b642345301cb2e23de07ec5386232317) I0412 19:56:22.542617 6496 nvc.c:346] using root / I0412 19:56:22.542633 6496 nvc.c:347] using ldcache /etc/ld.so.cache I0412 19:56:22.542647 6496 nvc.c:348] using unprivileged user 1000:1000 I0412 19:56:22.542726 6496 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0412 19:56:22.543074 6496 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment W0412 19:56:22.543425 6496 nvc.c:254] failed to detect NVIDIA devices W0412 19:56:22.543857 6497 nvc.c:269] failed to set inheritable capabilities W0412 19:56:22.543961 6497 nvc.c:270] skipping kernel modules load due to failure I0412 19:56:22.544487 6498 driver.c:101] starting driver service E0412 19:56:22.545001 6498 driver.c:161] could not start driver service: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory I0412 19:56:22.545295 6496 driver.c:196] driver service terminated successfully nvidia-container-cli: initialization error: driver error: failed to process request

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1" CUDA Driver Version / Runtime Version 10.2 / 10.2 CUDA Capability Major/Minor version number: 5.3 Total amount of global memory: 3964 MBytes (4156694528 bytes) ( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores GPU Max Clock rate: 922 MHz (0.92 GHz) Memory Clock rate: 13 Mhz Memory Bus Width: 64-bit L2 Cache Size: 262144 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: Yes Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: No Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1 Result = PASS

Description: I started to remove Iptables (sudo apt remove iptables) and the package manager removed Docker and Nvidia-Docker in the process. All of the nvidia packages (sudo dpkg-query -l | grep nvidia) seem to be intact, however.

Other than reflashing the Jetson Nano with the SD Card image for Jetpack 4.5.1, is there a way to simply reinstall the correct version of Docker and Nvidia-Docker2 that was used in the original Jetpack image (https://developer.nvidia.com/jetson-nano-sd-card-image.zip)?

Also, can Docker be reinstalled with sudo or does Docker have to be installed as root (sudo su)?

I tried to reinstall Docker by following the instructions at the following links:

[Installation Guide — NVIDIA Cloud Native Technologies documentation](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker)
[Install Docker Engine on Ubuntu | Docker Documentation](https://docs.docker.com/engine/install/ubuntu/) 

Docker version 19.03 is working on the Nano. Verified by running " sudo docker run hello-world" Nvidia-Docker (NVIDIA Container Toolkit) was also installed successfully.

However, when I verified the install using the following command, I ran into a run-time error: (Driver issue?)

Command: (Cuda 10) docker run --gpus all -it --rm --network host --volume ~/nvdli-data:/nvdli-nano/data --device /dev/video0 nvcr.io/nvidia/dli/dli-nano-ai:v2.0.1-r32.5.0

Error: docker: Error response from daemon: OCI runtime create failed: container_linux.g o:367: starting container process caused: process_linux.go:495: container init c aused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nv idia-container-cli: initialization error: driver error: failed to process reques t: unknown.

Nvidia's Response: "The nvidia-docker2_2.2.0-1_all.deb is included in the JetPack. Please use SDKmanager to download it (click reflash just for downloading the package)."

Can the debian package be downloaded/installed directly from the Nano using Apt package manager?

user@nano:~$ cat /etc/apt/sources.list.d/nvidia-container-runtime.list*

deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /

deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /

deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /

deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /

user@nano:~$ cat /etc/apt/sources.list.d/nvidia-docker.list

deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /

deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /

deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /

deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /

deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH) /

kaisark commented 3 years ago

SOLVED:

There were some package version conflicts (apt package repository issue), so I uninstalled and reinstalled nvidia_docker (and related packages) and cleaned up the apt repository issues (disabled some of the auto updates/upgrades, etc). I essentially restored the Jetpack versions of the packages downloaded with nvidia's sdkmanager tool.

nvidia_docker and nvidia_container_cli are working again and I can run nvidia containers from nvidia's container catalog( https://ngc.nvidia.com).

Steps:

  1. uninstall nvidia_docker2 (sudo apt remove nvidia_docker) - nvidia_container_toolkit and nvidia-container-runtime also removed in the process
  2. sudo apt remove libnvidia-container1
  3. sudo dpkg -i libnvidia-container-tools_0.9.0_beta.1_arm64.deb
  4. sudo dpkg -i libnvidia-container-tools_0.9.0_beta.1_arm64.deb
  5. sudo dpkg -i nvidia-container-toolkit_1.0.1-1_arm64.deb
  6. sudo dpkg -i nvidia-container-runtime_3.1.0-1_arm64.deb
  7. sudo dpkg -i nvidia-docker2_2.2.0-1_all.deb
  8. restart docker (sudo systemctl restart docker) nvidia-container-toolkit.log

I'll try to add more details when I get a chance, but do note the uninstall/install order of the packages does matter. )

pouss06 commented 3 years ago

I'm having the same issue upgrading docker from 19 to 20.

@kaisark How do you get libnvidia-container-tools_0.9.0_beta.1_arm64.deb ?

nvidia's sdkmanager does not allow to choose package version.

I added https://nvidia.githib.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) to /etc/aptsources.list.d/nvidia-container-runtime.list but after apt update, the oldest version of libnvidia-container-tools is the buggy 0.10.0. With apt cache policy can see all the newest version of the package up to 1.5.1-1. I tired to upgrade it, but I still have the same issue. I also tried to upgrade nvidia-docker2 up to the latest 2.6.0-1, and still having the issue.

btw: i did restart docker after each upgrade.

elezar commented 3 years ago

The "standard" repositories should not currently be used when installing the NVIDIA container stack (expecially libnvidia-container-tools) on Jetson devices. Please ensure that the nvidia.github.io repositories for the components are removed from your package lists and only those defined in the Jetpack SDK are available.

We are working on improving the experience going forward.

pouss06 commented 3 years ago

Thank you @elezar, but the packages defined in the Jetpack SDK are buggy.

All works fine with docker 19, but then when do an apt update && apt upgrade, docker is upgraded from version 19 to 20, and the option --runtime=nvidia or --gpus all are no longer supported. The docker run crash with the error pasted by @kaisark :

docker: Error response from daemon: OCI runtime create failed: container_linux.g o:367: starting container process caused: process_linux.go:495:

After my updates I still have the same error, expect the line numbers are now different (ex: container_linux.g o:380).

I need docker 20, so how may I upgrade docker through the packages provided by the Jetpack without having this issue ? I'm on Jetson nano, I had the same issue with Jetpack 4.4.1 and 4.6.

elezar commented 3 years ago

Did the apt update && apt upgrade also change the versions of libnvidia-container-tools, nvidia-container-runtime, nvidia-container-toolkit, and nvidia-docker2? Which versions are currently installed?

Here, the most important version is the libnvidia-container-tools version which must be 0.9.0_beta.1 at present. There is a 0.10.0 version that is being prepped for release with some minor fixes, but I don't have a timeline for when these will be out.

pouss06 commented 3 years ago

Did the apt update && apt upgrade also change the versions of libnvidia-container-tools, nvidia-container-runtime, nvidia-container-toolkit, and nvidia-docker2? Which versions are currently installed?

I dont know.

I first tried with the packages provided by the Jetpack repo (repo.download.nvidia.com)

libnvidia-container-tools   0.10.0
nvidia-container-runtime 3.1.0-1
nvidia-container-toolkit 1.0.1-1
nvidia-docker2  2.2.0-1

I then tried the latest package provided by nvidia.githib.io as mentioned in my previous post:

libnvidia-container-tools 1.5.1-1
nvidia-container-runtime 3.5.0-1
nvidia-container-toolkit 1.5.1-1
nvidia-docker2  2.6.0-1

Here, the most important version is the libnvidia-container-tools version which must be 0.9.0_beta.1 at present. There is a 0.10.0 version that is being prepped for release with some minor fixes, but I don't have a timeline for when these will be out.

As I asked earlier where may I find libnvidia-container-tools 0.9.0_beta.1 ? The repository https://repo.download.nvidia.com/jetson/common only gives me the 0.10.0. Is there an url like https://repo.download.nvidia.com/jetson/experimental ?

klueska commented 3 years ago

As @elezar mentioned, for jetson you need to be using the packages from https://repo.download.nvidia.com/jetson/common. The packages from https://nvidia.githib.io/libnvidia-container will not work.

Since the v0.10.0 release of libnvidia-container is out, the only way to get v0.9.0 is to explicitly list it on the command line when you install those packages.

However, can you explain what is buggy about v0.10.0 of the libnvidia-container1 and libnvidia-container-tools packages? They were meant to be backwards compatible with v0.9.0, so it would be good to understand what issue you are facing.

elezar commented 3 years ago

It may be that the specific package version is being selected based on your Jetpack version. On my local Nano I have:

$ cat /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
deb https://repo.download.nvidia.com/jetson/common r32.5 main
deb https://repo.download.nvidia.com/jetson/t210 r32.5 main

I haven't updated my Jetpack distribution in a while.

$ apt list -a  libnvidia-container-tools
Listing... Done
libnvidia-container-tools/stable,now 0.9.0~beta.1 arm64 [installed]
pouss06 commented 3 years ago

As @elezar mentioned, for jetson you need to be using the packages from https://repo.download.nvidia.com/jetson/common. The packages from https://nvidia.githib.io/libnvidia-container will not work.

Dont worry I fully undertood, I was just listing all I tried before posting here. I reinstalled all packages from repo.download.nvidia.com

$ apt list --installed | grep nvidia
libnvidia-container-tools/stable,now 0.10.0+jetpack arm64 [installed]
libnvidia-container0/stable,now 0.10.0+jetpack arm64 [installed]
nvidia-container-csv-cuda/stable,now 10.2.460-1 arm64 [installed]
nvidia-container-csv-cudnn/stable,now 8.2.1.32-1+cuda10.2 arm64 [installed]
nvidia-container-csv-tensorrt/stable,now 8.0.1.6-1+cuda10.2 arm64 [installed]
nvidia-container-csv-visionworks/stable,now 1.6.0.501 arm64 [installed]
nvidia-container-runtime/stable,now 3.1.0-1 arm64 [installed]
nvidia-container-toolkit/stable,now 1.0.1-1 arm64 [installed,automatic]
nvidia-docker2/stable,now 2.2.0-1 all [installed]
[...]

Since the v0.10.0 release of libnvidia-container is out, the only way to get v0.9.0 is to explicitly list it on the command line when you install those packages.

How may I explicitly list 0.9.0 as it does not seem to be present in https://repo.download.nvidia.com/jetson/common ?

$ apt list -a  libnvidia-container-tools
Listing... Done
libnvidia-container-tools/stable 0.10.0+jetpack arm64
$ cat /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
deb https://repo.download.nvidia.com/jetson/common r32.6 main
deb https://repo.download.nvidia.com/jetson/t210 r32.6 main

However, can you explain what is buggy about v0.10.0 of the libnvidia-container1 and libnvidia-container-tools packages? They were meant to be backwards compatible with v0.9.0, so it would be good to understand what issue you are facing.

After rolling back to the packages from repo.download.nvidia.com/jetson/common. I do not have anymore the issue. I really do not understand why. I had the issue on 4.4 and 4.6, without touching to the apt repositories.

What I had was a crash from docker run just after launching it, and it happened only with with --runtime nvidia. The error is mentioned above.

Anyway I need to rollback to jetpack 4.5 as I need deepstream and it's not supported on 4.6. After reintsallation of jetpack 4.5 I'll let you know if I still experience the problem, and what will be the result of apt list -a libnvidia-container-tools.

Thank you for the support

aniongithub commented 3 years ago

This procedure is also useful for people who don't use Jetpack to manually install their images, but create a custom image from scratch like this one - https://github.com/aniongithub/jetson-nano-image/releases

This is useful for smaller size and non-interactive custom builds that require no manual configuration.

michael-sbarra commented 2 years ago

I am facing a similar issue. I'm using a Jetson Nano B02 with Jetpack4.6.1/l4t32.6.1. I am able to get the docker container to run (with either --runtime nvidia or --gpus all), however torchvision is unable to be imported. I've tried images nvcr.io/nvidia/dli/dli-nano-ai:v2.0.1-r32.6.1 and nvcr.io/nvidia/l4t-pytorch:r32.6.1-pth1.9-py3, both resulting in the same error.

The installed versions of the deps:

$ apt list --installed | grep nvidia
libnvidia-container-tools/stable,now 0.10.0+jetpack arm64 [installed]
libnvidia-container0/stable,now 0.10.0+jetpack arm64 [installed,automatic]
nvidia-container-runtime/stable,now 3.1.0-1 arm64 [installed]
nvidia-container-toolkit/stable,now 1.0.1-1 arm64 [installed]
nvidia-docker2/stable,now 2.2.0-1 all [installed]

From a python terminal from within either container:

>>> import torchvision.transforms as transforms
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/__init__.py", line 6, in <module>
    from torchvision import models
  File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/models/__init__.py", line 1, in <module>
    from .alexnet import *
  File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/models/alexnet.py", line 1, in <module>
    import torch
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 196, in <module>
    _load_global_deps()
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 149, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory

Looking at the libs in the cuda dir within either container:

$ ls -lA /usr/local/cuda-10.2/lib64/
total 1536
-rw-r--r-- 1 root root 679636 Jul 23 18:11 libcudadevrt.a
-rw-r--r-- 1 root root 888074 Jul 23 18:11 libcudart_static.a
drwxr-xr-x 2 root root   4096 Jul 23 18:19 stubs

Looking at the libs in the cuda dir from host:

$ /usr/local/cuda-10.2/lib64/
total 2259940
lrwxrwxrwx 1 root root        17 Mar  1  2021 libcublasLt.so -> libcublasLt.so.10
lrwxrwxrwx 1 root root        25 Mar  1  2021 libcublasLt.so.10 -> libcublasLt.so.10.2.3.300
-rw-r--r-- 1 root root  33562824 Mar  1  2021 libcublasLt.so.10.2.3.300
-rw-r--r-- 1 root root  36011742 Mar  1  2021 libcublasLt_static.a
lrwxrwxrwx 1 root root        15 Mar  1  2021 libcublas.so -> libcublas.so.10
lrwxrwxrwx 1 root root        23 Mar  1  2021 libcublas.so.10 -> libcublas.so.10.2.3.300
-rw-r--r-- 1 root root  81096256 Mar  1  2021 libcublas.so.10.2.3.300
-rw-r--r-- 1 root root  96903266 Mar  1  2021 libcublas_static.a
-rw-r--r-- 1 root root    679636 Mar  1  2021 libcudadevrt.a
lrwxrwxrwx 1 root root        17 Mar  1  2021 libcudart.so -> libcudart.so.10.2
lrwxrwxrwx 1 root root        21 Mar  1  2021 libcudart.so.10.2 -> libcudart.so.10.2.300
-rw-r--r-- 1 root root    490664 Mar  1  2021 libcudart.so.10.2.300
-rw-r--r-- 1 root root    888074 Mar  1  2021 libcudart_static.a
lrwxrwxrwx 1 root root        14 Mar  1  2021 libcufft.so -> libcufft.so.10
lrwxrwxrwx 1 root root        22 Mar  1  2021 libcufft.so.10 -> libcufft.so.10.1.2.300
-rw-r--r-- 1 root root 201494704 Mar  1  2021 libcufft.so.10.1.2.300
-rw-r--r-- 1 root root 192531512 Mar  1  2021 libcufft_static.a
-rw-r--r-- 1 root root 210524874 Mar  1  2021 libcufft_static_nocallback.a
lrwxrwxrwx 1 root root        15 Mar  1  2021 libcufftw.so -> libcufftw.so.10
lrwxrwxrwx 1 root root        23 Mar  1  2021 libcufftw.so.10 -> libcufftw.so.10.1.2.300
-rw-r--r-- 1 root root    503192 Mar  1  2021 libcufftw.so.10.1.2.300
-rw-r--r-- 1 root root     31970 Mar  1  2021 libcufftw_static.a
lrwxrwxrwx 1 root root        18 Mar  1  2021 libcuinj64.so -> libcuinj64.so.10.2
lrwxrwxrwx 1 root root        22 Mar  1  2021 libcuinj64.so.10.2 -> libcuinj64.so.10.2.300
-rw-r--r-- 1 root root   1535464 Mar  1  2021 libcuinj64.so.10.2.300
-rw-r--r-- 1 root root     33242 Mar  1  2021 libculibos.a
lrwxrwxrwx 1 root root        16 Mar  1  2021 libcupti.so -> libcupti.so.10.2
lrwxrwxrwx 1 root root        20 Mar  1  2021 libcupti.so.10.2 -> libcupti.so.10.2.175
-rw-r--r-- 1 root root   4526616 Mar  1  2021 libcupti.so.10.2.175
lrwxrwxrwx 1 root root        15 Mar  1  2021 libcurand.so -> libcurand.so.10
lrwxrwxrwx 1 root root        23 Mar  1  2021 libcurand.so.10 -> libcurand.so.10.1.2.300
-rw-r--r-- 1 root root  62698584 Mar  1  2021 libcurand.so.10.1.2.300
-rw-r--r-- 1 root root  62767380 Mar  1  2021 libcurand_static.a
lrwxrwxrwx 1 root root        17 Mar  1  2021 libcusolver.so -> libcusolver.so.10
lrwxrwxrwx 1 root root        25 Mar  1  2021 libcusolver.so.10 -> libcusolver.so.10.3.0.300
-rw-r--r-- 1 root root 218927328 Mar  1  2021 libcusolver.so.10.3.0.300
-rw-r--r-- 1 root root 123895098 Mar  1  2021 libcusolver_static.a
lrwxrwxrwx 1 root root        17 Mar  1  2021 libcusparse.so -> libcusparse.so.10
lrwxrwxrwx 1 root root        25 Mar  1  2021 libcusparse.so.10 -> libcusparse.so.10.3.1.300
-rw-r--r-- 1 root root 141252584 Mar  1  2021 libcusparse.so.10.3.1.300
-rw-r--r-- 1 root root 149512102 Mar  1  2021 libcusparse_static.a
-rw-r--r-- 1 root root   8319056 Mar  1  2021 liblapack_static.a
-rw-r--r-- 1 root root    909274 Mar  1  2021 libmetis_static.a
lrwxrwxrwx 1 root root        13 Mar  1  2021 libnppc.so -> libnppc.so.10
lrwxrwxrwx 1 root root        21 Mar  1  2021 libnppc.so.10 -> libnppc.so.10.2.1.300
-rw-r--r-- 1 root root    503184 Mar  1  2021 libnppc.so.10.2.1.300
-rw-r--r-- 1 root root     26846 Mar  1  2021 libnppc_static.a
lrwxrwxrwx 1 root root        15 Mar  1  2021 libnppial.so -> libnppial.so.10
lrwxrwxrwx 1 root root        23 Mar  1  2021 libnppial.so.10 -> libnppial.so.10.2.1.300
-rw-r--r-- 1 root root  11509472 Mar  1  2021 libnppial.so.10.2.1.300
-rw-r--r-- 1 root root  14410930 Mar  1  2021 libnppial_static.a
lrwxrwxrwx 1 root root        15 Mar  1  2021 libnppicc.so -> libnppicc.so.10
lrwxrwxrwx 1 root root        23 Mar  1  2021 libnppicc.so.10 -> libnppicc.so.10.2.1.300
-rw-r--r-- 1 root root   4914920 Mar  1  2021 libnppicc.so.10.2.1.300
-rw-r--r-- 1 root root   5722536 Mar  1  2021 libnppicc_static.a
lrwxrwxrwx 1 root root        16 Mar  1  2021 libnppicom.so -> libnppicom.so.10
lrwxrwxrwx 1 root root        24 Mar  1  2021 libnppicom.so.10 -> libnppicom.so.10.2.1.300
-rw-r--r-- 1 root root   1453728 Mar  1  2021 libnppicom.so.10.2.1.300
-rw-r--r-- 1 root root   1093680 Mar  1  2021 libnppicom_static.a
lrwxrwxrwx 1 root root        16 Mar  1  2021 libnppidei.so -> libnppidei.so.10
lrwxrwxrwx 1 root root        24 Mar  1  2021 libnppidei.so.10 -> libnppidei.so.10.2.1.300
-rw-r--r-- 1 root root   8175688 Mar  1  2021 libnppidei.so.10.2.1.300
-rw-r--r-- 1 root root  10762478 Mar  1  2021 libnppidei_static.a
lrwxrwxrwx 1 root root        14 Mar  1  2021 libnppif.so -> libnppif.so.10
lrwxrwxrwx 1 root root        22 Mar  1  2021 libnppif.so.10 -> libnppif.so.10.2.1.300
-rw-r--r-- 1 root root  54362944 Mar  1  2021 libnppif.so.10.2.1.300
-rw-r--r-- 1 root root  58471042 Mar  1  2021 libnppif_static.a
lrwxrwxrwx 1 root root        14 Mar  1  2021 libnppig.so -> libnppig.so.10
lrwxrwxrwx 1 root root        22 Mar  1  2021 libnppig.so.10 -> libnppig.so.10.2.1.300
-rw-r--r-- 1 root root  28761920 Mar  1  2021 libnppig.so.10.2.1.300
-rw-r--r-- 1 root root  31432462 Mar  1  2021 libnppig_static.a
lrwxrwxrwx 1 root root        14 Mar  1  2021 libnppim.so -> libnppim.so.10
lrwxrwxrwx 1 root root        22 Mar  1  2021 libnppim.so.10 -> libnppim.so.10.2.1.300
-rw-r--r-- 1 root root   7163640 Mar  1  2021 libnppim.so.10.2.1.300
-rw-r--r-- 1 root root   7396476 Mar  1  2021 libnppim_static.a
lrwxrwxrwx 1 root root        15 Mar  1  2021 libnppist.so -> libnppist.so.10
lrwxrwxrwx 1 root root        23 Mar  1  2021 libnppist.so.10 -> libnppist.so.10.2.1.300
-rw-r--r-- 1 root root  20877336 Mar  1  2021 libnppist.so.10.2.1.300
-rw-r--r-- 1 root root  23399160 Mar  1  2021 libnppist_static.a
lrwxrwxrwx 1 root root        15 Mar  1  2021 libnppisu.so -> libnppisu.so.10
lrwxrwxrwx 1 root root        23 Mar  1  2021 libnppisu.so.10 -> libnppisu.so.10.2.1.300
-rw-r--r-- 1 root root    486576 Mar  1  2021 libnppisu.so.10.2.1.300
-rw-r--r-- 1 root root     11458 Mar  1  2021 libnppisu_static.a
lrwxrwxrwx 1 root root        15 Mar  1  2021 libnppitc.so -> libnppitc.so.10
lrwxrwxrwx 1 root root        23 Mar  1  2021 libnppitc.so.10 -> libnppitc.so.10.2.1.300
-rw-r--r-- 1 root root   3112480 Mar  1  2021 libnppitc.so.10.2.1.300
-rw-r--r-- 1 root root   3205362 Mar  1  2021 libnppitc_static.a
lrwxrwxrwx 1 root root        13 Mar  1  2021 libnpps.so -> libnpps.so.10
lrwxrwxrwx 1 root root        21 Mar  1  2021 libnpps.so.10 -> libnpps.so.10.2.1.300
-rw-r--r-- 1 root root   9539760 Mar  1  2021 libnpps.so.10.2.1.300
-rw-r--r-- 1 root root  10690508 Mar  1  2021 libnpps_static.a
lrwxrwxrwx 1 root root        15 Mar  1  2021 libnvblas.so -> libnvblas.so.10
lrwxrwxrwx 1 root root        23 Mar  1  2021 libnvblas.so.10 -> libnvblas.so.10.2.3.300
-rw-r--r-- 1 root root    540232 Mar  1  2021 libnvblas.so.10.2.3.300
lrwxrwxrwx 1 root root        16 Mar  1  2021 libnvgraph.so -> libnvgraph.so.10
lrwxrwxrwx 1 root root        22 Mar  1  2021 libnvgraph.so.10 -> libnvgraph.so.10.2.300
-rw-r--r-- 1 root root 165012616 Mar  1  2021 libnvgraph.so.10.2.300
-rw-r--r-- 1 root root 168141386 Mar  1  2021 libnvgraph_static.a
-rw-r--r-- 1 root root   7430712 Mar  1  2021 libnvperf_host.so
-rw-r--r-- 1 root root   1096016 Mar  1  2021 libnvperf_target.so
lrwxrwxrwx 1 root root        25 Mar  1  2021 libnvrtc-builtins.so -> libnvrtc-builtins.so.10.2
lrwxrwxrwx 1 root root        29 Mar  1  2021 libnvrtc-builtins.so.10.2 -> libnvrtc-builtins.so.10.2.300
-rw-r--r-- 1 root root   4794168 Mar  1  2021 libnvrtc-builtins.so.10.2.300
lrwxrwxrwx 1 root root        16 Mar  1  2021 libnvrtc.so -> libnvrtc.so.10.2
lrwxrwxrwx 1 root root        20 Mar  1  2021 libnvrtc.so.10.2 -> libnvrtc.so.10.2.300
-rw-r--r-- 1 root root  20432800 Mar  1  2021 libnvrtc.so.10.2.300
lrwxrwxrwx 1 root root        18 Mar  1  2021 libnvToolsExt.so -> libnvToolsExt.so.1
lrwxrwxrwx 1 root root        22 Mar  1  2021 libnvToolsExt.so.1 -> libnvToolsExt.so.1.0.0
-rw-r--r-- 1 root root     44088 Mar  1  2021 libnvToolsExt.so.1.0.0
drwxr-xr-x 2 root root      4096 Oct 10 14:35 stubs

nvcc from host:

$ /usr/local/cuda-10.2/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:34:44_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0

nvcc from within either container:

$ /usr/local/cuda-10.2/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:34:44_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0
hvalev commented 2 years ago

I am having the exact same issue with Jetpack 4.6.0 on the following system:

 - NVIDIA Jetson AGX Xavier [16GB]
   * Jetpack 4.6 [L4T 32.6.1]
   * NV Power Mode: MODE_30W_ALL - Type: 3
   * jetson_stats.service: active
 - Libraries:
   * CUDA: 10.2.300
   * cuDNN: 8.2.1.32
   * TensorRT: 8.0.1.6
   * Visionworks: 1.6.0.501
   * OpenCV: 4.1.1 compiled CUDA: NO
   * VPI: ii libnvvpi1 1.1.15 arm64 NVIDIA Vision Programming Interface library
   * Vulkan: 1.2.70

docker version:

Client:
 Version:           20.10.7
 API version:       1.41
 Go version:        go1.13.8
 Git commit:        20.10.7-0ubuntu5~18.04.3
 Built:             Mon Nov  1 01:04:31 2021
 OS/Arch:           linux/arm64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.7
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.8
  Git commit:       20.10.7-0ubuntu5~18.04.3
  Built:            Fri Oct 22 00:57:37 2021
  OS/Arch:          linux/arm64
  Experimental:     false
 containerd:
  Version:          1.5.5-0ubuntu3~18.04.2
  GitCommit:        
 runc:
  Version:          1.0.1-0ubuntu2~18.04.1
  GitCommit:        
 docker-init:
  Version:          0.19.0
  GitCommit: 

and following packages setup:

libnvidia-container-tools/stable,now 0.10.0+jetpack arm64 [installed]
libnvidia-container0/stable,now 0.10.0+jetpack arm64 [installed,automatic]
nvidia-container-csv-cuda/stable,now 10.2.460-1 arm64 [installed]
nvidia-container-csv-cudnn/stable,now 8.2.1.32-1+cuda10.2 arm64 [installed]
nvidia-container-csv-tensorrt/stable,now 8.0.1.6-1+cuda10.2 arm64 [installed]
nvidia-container-csv-visionworks/stable,now 1.6.0.501 arm64 [installed]
nvidia-container-runtime/stable,now 3.1.0-1 arm64 [installed,automatic]
nvidia-container-toolkit/stable,now 1.0.1-1 arm64 [installed,automatic]
nvidia-docker2/stable,now 2.2.0-1 all [installed]
nvidia-l4t-3d-core/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-apt-source/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-bootloader/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-camera/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-configs/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-core/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-cuda/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-firmware/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-gputools/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-graphics-demos/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-gstreamer/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-init/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-initrd/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-jetson-io/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-jetson-multimedia-api/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-kernel/stable,now 4.9.253-tegra-32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-kernel-dtbs/stable,now 4.9.253-tegra-32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-kernel-headers/stable,now 4.9.253-tegra-32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-libvulkan/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-multimedia/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-multimedia-utils/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-oem-config/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-tools/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-wayland/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-weston/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-x11/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-xusb-firmware/stable,now 32.6.1-20210726122859 arm64 [installed]

The problem happens when trying to build a custom docker image using nvcr.io/nvidia/l4t-ml:r32.6.1-py3 as base and installing the mmcv-full library. Looking within the container, similar to @michael-sbarra, I can also see that libcurand.so.10 exists and is properly linked. LD_LIBRARY_PATH as well as PATH contain the path to as well.

klueska commented 2 years ago

@hvalev Are you sure you are seeing an error of:

could not start driver service: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory

With the packages you list, that should not be possible.

hvalev commented 2 years ago

No, the error that I am seeing is:

>>> import torchvision.transforms as transforms
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/__init__.py", line 6, in <module>
    from torchvision import models
  File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/models/__init__.py", line 1, in <module>
    from .alexnet import *
  File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/models/alexnet.py", line 1, in <module>
    import torch
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 196, in <module>
    _load_global_deps()
  File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 149, in _load_global_deps
    ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
  File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory

when I'm building the mmcv-full package with pip3

klueska commented 2 years ago

How does this relate to the original issue? In any case, it does not seem to be related to nvidia-docker itself, but rather one of the images running on top of it. You would have better luck posting in: https://forums.developer.nvidia.com/c/accelerated-computing/nvidia-gpu-cloud-ngc-users/docker-and-nvidia-docker/33

hvalev commented 2 years ago

Because it is the exact same issue as @michael-sbarra is having. If you think so, however, the command pip3 install mmcv-full executes from within the container (if I build it just before that command and then exec into it and run it manually), but does not run from within the docker build context (docker build .) where it fails with the aforementioned error message. Anyhow, I'm open to recommendations where I can take this issue.

elezar commented 2 years ago

@hvalev as a matter of interest, which nvidia-container-csv-* packages are available for installation from the Jetpack repositories?

hvalev commented 2 years ago

Hi, sorry for the delay. I actually need to correct myself slightly. Indeed, the problem I was experiencing was due to the fact that I did not set the nvidia-runtime as the default runtime in /etc/docker/daemon.json. This discrepancy caused the difference in the contexts for 1) building a docker image from a Dockerfile and 2) actually running an already built image as a container, where I was explicitly indicating the runtime in the docker run command. For reference, I'm pasting below the daemon.json config.

{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}
elezar commented 8 months ago

I am closing this as it seems to have been caused by an incorrectly configured Docker daemon. If there are still problems, please reopen this or create a new issue.