Closed kaisark closed 8 months ago
SOLVED:
There were some package version conflicts (apt package repository issue), so I uninstalled and reinstalled nvidia_docker (and related packages) and cleaned up the apt repository issues (disabled some of the auto updates/upgrades, etc). I essentially restored the Jetpack versions of the packages downloaded with nvidia's sdkmanager tool.
nvidia_docker and nvidia_container_cli are working again and I can run nvidia containers from nvidia's container catalog( https://ngc.nvidia.com).
Steps:
I'll try to add more details when I get a chance, but do note the uninstall/install order of the packages does matter. )
I'm having the same issue upgrading docker from 19 to 20.
@kaisark How do you get libnvidia-container-tools_0.9.0_beta.1_arm64.deb ?
nvidia's sdkmanager does not allow to choose package version.
I added https://nvidia.githib.io/libnvidia-container/stable/ubuntu18.04/$(ARCH)
to /etc/aptsources.list.d/nvidia-container-runtime.list
but after apt update
, the oldest version of libnvidia-container-tools
is the buggy 0.10.0
. With apt cache policy
can see all the newest version of the package up to 1.5.1-1
. I tired to upgrade it, but I still have the same issue.
I also tried to upgrade nvidia-docker2
up to the latest 2.6.0-1
, and still having the issue.
btw: i did restart docker after each upgrade.
The "standard" repositories should not currently be used when installing the NVIDIA container stack (expecially libnvidia-container-tools
) on Jetson devices. Please ensure that the nvidia.github.io
repositories for the components are removed from your package lists and only those defined in the Jetpack SDK are available.
We are working on improving the experience going forward.
Thank you @elezar, but the packages defined in the Jetpack SDK are buggy.
All works fine with docker 19, but then when do an apt update && apt upgrade
, docker is upgraded from version 19 to 20, and the option --runtime=nvidia
or --gpus all
are no longer supported. The docker run crash with the error pasted by @kaisark :
docker: Error response from daemon: OCI runtime create failed: container_linux.g o:367: starting container process caused: process_linux.go:495:
After my updates I still have the same error, expect the line numbers are now different (ex: container_linux.g o:380).
I need docker 20, so how may I upgrade docker through the packages provided by the Jetpack without having this issue ? I'm on Jetson nano, I had the same issue with Jetpack 4.4.1 and 4.6.
Did the apt update && apt upgrade
also change the versions of libnvidia-container-tools
, nvidia-container-runtime
, nvidia-container-toolkit
, and nvidia-docker2
? Which versions are currently installed?
Here, the most important version is the libnvidia-container-tools
version which must be 0.9.0_beta.1
at present. There is a 0.10.0
version that is being prepped for release with some minor fixes, but I don't have a timeline for when these will be out.
Did the
apt update && apt upgrade
also change the versions oflibnvidia-container-tools
,nvidia-container-runtime
,nvidia-container-toolkit
, andnvidia-docker2
? Which versions are currently installed?
I dont know.
I first tried with the packages provided by the Jetpack repo (repo.download.nvidia.com)
libnvidia-container-tools 0.10.0
nvidia-container-runtime 3.1.0-1
nvidia-container-toolkit 1.0.1-1
nvidia-docker2 2.2.0-1
I then tried the latest package provided by nvidia.githib.io as mentioned in my previous post:
libnvidia-container-tools 1.5.1-1
nvidia-container-runtime 3.5.0-1
nvidia-container-toolkit 1.5.1-1
nvidia-docker2 2.6.0-1
Here, the most important version is the
libnvidia-container-tools
version which must be0.9.0_beta.1
at present. There is a0.10.0
version that is being prepped for release with some minor fixes, but I don't have a timeline for when these will be out.
As I asked earlier where may I find libnvidia-container-tools 0.9.0_beta.1
?
The repository https://repo.download.nvidia.com/jetson/common
only gives me the 0.10.0
. Is there an url like https://repo.download.nvidia.com/jetson/experimental
?
As @elezar mentioned, for jetson you need to be using the packages from https://repo.download.nvidia.com/jetson/common
. The packages from https://nvidia.githib.io/libnvidia-container
will not work.
Since the v0.10.0
release of libnvidia-container is out, the only way to get v0.9.0 is to explicitly list it on the command line when you install those packages.
However, can you explain what is buggy about v0.10.0
of the libnvidia-container1
and libnvidia-container-tools
packages? They were meant to be backwards compatible with v0.9.0
, so it would be good to understand what issue you are facing.
It may be that the specific package version is being selected based on your Jetpack version. On my local Nano I have:
$ cat /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
deb https://repo.download.nvidia.com/jetson/common r32.5 main
deb https://repo.download.nvidia.com/jetson/t210 r32.5 main
I haven't updated my Jetpack distribution in a while.
$ apt list -a libnvidia-container-tools
Listing... Done
libnvidia-container-tools/stable,now 0.9.0~beta.1 arm64 [installed]
As @elezar mentioned, for jetson you need to be using the packages from
https://repo.download.nvidia.com/jetson/common
. The packages fromhttps://nvidia.githib.io/libnvidia-container
will not work.
Dont worry I fully undertood, I was just listing all I tried before posting here. I reinstalled all packages from repo.download.nvidia.com
$ apt list --installed | grep nvidia
libnvidia-container-tools/stable,now 0.10.0+jetpack arm64 [installed]
libnvidia-container0/stable,now 0.10.0+jetpack arm64 [installed]
nvidia-container-csv-cuda/stable,now 10.2.460-1 arm64 [installed]
nvidia-container-csv-cudnn/stable,now 8.2.1.32-1+cuda10.2 arm64 [installed]
nvidia-container-csv-tensorrt/stable,now 8.0.1.6-1+cuda10.2 arm64 [installed]
nvidia-container-csv-visionworks/stable,now 1.6.0.501 arm64 [installed]
nvidia-container-runtime/stable,now 3.1.0-1 arm64 [installed]
nvidia-container-toolkit/stable,now 1.0.1-1 arm64 [installed,automatic]
nvidia-docker2/stable,now 2.2.0-1 all [installed]
[...]
Since the
v0.10.0
release of libnvidia-container is out, the only way to get v0.9.0 is to explicitly list it on the command line when you install those packages.
How may I explicitly list 0.9.0 as it does not seem to be present in https://repo.download.nvidia.com/jetson/common
?
$ apt list -a libnvidia-container-tools
Listing... Done
libnvidia-container-tools/stable 0.10.0+jetpack arm64
$ cat /etc/apt/sources.list.d/nvidia-l4t-apt-source.list
deb https://repo.download.nvidia.com/jetson/common r32.6 main
deb https://repo.download.nvidia.com/jetson/t210 r32.6 main
However, can you explain what is buggy about
v0.10.0
of thelibnvidia-container1
andlibnvidia-container-tools
packages? They were meant to be backwards compatible withv0.9.0
, so it would be good to understand what issue you are facing.
After rolling back to the packages from repo.download.nvidia.com/jetson/common. I do not have anymore the issue. I really do not understand why. I had the issue on 4.4 and 4.6, without touching to the apt repositories.
What I had was a crash from docker run
just after launching it, and it happened only with with --runtime nvidia
. The error is mentioned above.
Anyway I need to rollback to jetpack 4.5 as I need deepstream and it's not supported on 4.6.
After reintsallation of jetpack 4.5 I'll let you know if I still experience the problem, and what will be the result of apt list -a libnvidia-container-tools
.
Thank you for the support
This procedure is also useful for people who don't use Jetpack to manually install their images, but create a custom image from scratch like this one - https://github.com/aniongithub/jetson-nano-image/releases
This is useful for smaller size and non-interactive custom builds that require no manual configuration.
I am facing a similar issue. I'm using a Jetson Nano B02 with Jetpack4.6.1/l4t32.6.1. I am able to get the docker container to run (with either --runtime nvidia
or --gpus all
), however torchvision is unable to be imported. I've tried images nvcr.io/nvidia/dli/dli-nano-ai:v2.0.1-r32.6.1
and nvcr.io/nvidia/l4t-pytorch:r32.6.1-pth1.9-py3
, both resulting in the same error.
The installed versions of the deps:
$ apt list --installed | grep nvidia
libnvidia-container-tools/stable,now 0.10.0+jetpack arm64 [installed]
libnvidia-container0/stable,now 0.10.0+jetpack arm64 [installed,automatic]
nvidia-container-runtime/stable,now 3.1.0-1 arm64 [installed]
nvidia-container-toolkit/stable,now 1.0.1-1 arm64 [installed]
nvidia-docker2/stable,now 2.2.0-1 all [installed]
From a python terminal from within either container:
>>> import torchvision.transforms as transforms
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/__init__.py", line 6, in <module>
from torchvision import models
File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/models/__init__.py", line 1, in <module>
from .alexnet import *
File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/models/alexnet.py", line 1, in <module>
import torch
File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 196, in <module>
_load_global_deps()
File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 149, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory
Looking at the libs in the cuda dir within either container:
$ ls -lA /usr/local/cuda-10.2/lib64/
total 1536
-rw-r--r-- 1 root root 679636 Jul 23 18:11 libcudadevrt.a
-rw-r--r-- 1 root root 888074 Jul 23 18:11 libcudart_static.a
drwxr-xr-x 2 root root 4096 Jul 23 18:19 stubs
Looking at the libs in the cuda dir from host:
$ /usr/local/cuda-10.2/lib64/
total 2259940
lrwxrwxrwx 1 root root 17 Mar 1 2021 libcublasLt.so -> libcublasLt.so.10
lrwxrwxrwx 1 root root 25 Mar 1 2021 libcublasLt.so.10 -> libcublasLt.so.10.2.3.300
-rw-r--r-- 1 root root 33562824 Mar 1 2021 libcublasLt.so.10.2.3.300
-rw-r--r-- 1 root root 36011742 Mar 1 2021 libcublasLt_static.a
lrwxrwxrwx 1 root root 15 Mar 1 2021 libcublas.so -> libcublas.so.10
lrwxrwxrwx 1 root root 23 Mar 1 2021 libcublas.so.10 -> libcublas.so.10.2.3.300
-rw-r--r-- 1 root root 81096256 Mar 1 2021 libcublas.so.10.2.3.300
-rw-r--r-- 1 root root 96903266 Mar 1 2021 libcublas_static.a
-rw-r--r-- 1 root root 679636 Mar 1 2021 libcudadevrt.a
lrwxrwxrwx 1 root root 17 Mar 1 2021 libcudart.so -> libcudart.so.10.2
lrwxrwxrwx 1 root root 21 Mar 1 2021 libcudart.so.10.2 -> libcudart.so.10.2.300
-rw-r--r-- 1 root root 490664 Mar 1 2021 libcudart.so.10.2.300
-rw-r--r-- 1 root root 888074 Mar 1 2021 libcudart_static.a
lrwxrwxrwx 1 root root 14 Mar 1 2021 libcufft.so -> libcufft.so.10
lrwxrwxrwx 1 root root 22 Mar 1 2021 libcufft.so.10 -> libcufft.so.10.1.2.300
-rw-r--r-- 1 root root 201494704 Mar 1 2021 libcufft.so.10.1.2.300
-rw-r--r-- 1 root root 192531512 Mar 1 2021 libcufft_static.a
-rw-r--r-- 1 root root 210524874 Mar 1 2021 libcufft_static_nocallback.a
lrwxrwxrwx 1 root root 15 Mar 1 2021 libcufftw.so -> libcufftw.so.10
lrwxrwxrwx 1 root root 23 Mar 1 2021 libcufftw.so.10 -> libcufftw.so.10.1.2.300
-rw-r--r-- 1 root root 503192 Mar 1 2021 libcufftw.so.10.1.2.300
-rw-r--r-- 1 root root 31970 Mar 1 2021 libcufftw_static.a
lrwxrwxrwx 1 root root 18 Mar 1 2021 libcuinj64.so -> libcuinj64.so.10.2
lrwxrwxrwx 1 root root 22 Mar 1 2021 libcuinj64.so.10.2 -> libcuinj64.so.10.2.300
-rw-r--r-- 1 root root 1535464 Mar 1 2021 libcuinj64.so.10.2.300
-rw-r--r-- 1 root root 33242 Mar 1 2021 libculibos.a
lrwxrwxrwx 1 root root 16 Mar 1 2021 libcupti.so -> libcupti.so.10.2
lrwxrwxrwx 1 root root 20 Mar 1 2021 libcupti.so.10.2 -> libcupti.so.10.2.175
-rw-r--r-- 1 root root 4526616 Mar 1 2021 libcupti.so.10.2.175
lrwxrwxrwx 1 root root 15 Mar 1 2021 libcurand.so -> libcurand.so.10
lrwxrwxrwx 1 root root 23 Mar 1 2021 libcurand.so.10 -> libcurand.so.10.1.2.300
-rw-r--r-- 1 root root 62698584 Mar 1 2021 libcurand.so.10.1.2.300
-rw-r--r-- 1 root root 62767380 Mar 1 2021 libcurand_static.a
lrwxrwxrwx 1 root root 17 Mar 1 2021 libcusolver.so -> libcusolver.so.10
lrwxrwxrwx 1 root root 25 Mar 1 2021 libcusolver.so.10 -> libcusolver.so.10.3.0.300
-rw-r--r-- 1 root root 218927328 Mar 1 2021 libcusolver.so.10.3.0.300
-rw-r--r-- 1 root root 123895098 Mar 1 2021 libcusolver_static.a
lrwxrwxrwx 1 root root 17 Mar 1 2021 libcusparse.so -> libcusparse.so.10
lrwxrwxrwx 1 root root 25 Mar 1 2021 libcusparse.so.10 -> libcusparse.so.10.3.1.300
-rw-r--r-- 1 root root 141252584 Mar 1 2021 libcusparse.so.10.3.1.300
-rw-r--r-- 1 root root 149512102 Mar 1 2021 libcusparse_static.a
-rw-r--r-- 1 root root 8319056 Mar 1 2021 liblapack_static.a
-rw-r--r-- 1 root root 909274 Mar 1 2021 libmetis_static.a
lrwxrwxrwx 1 root root 13 Mar 1 2021 libnppc.so -> libnppc.so.10
lrwxrwxrwx 1 root root 21 Mar 1 2021 libnppc.so.10 -> libnppc.so.10.2.1.300
-rw-r--r-- 1 root root 503184 Mar 1 2021 libnppc.so.10.2.1.300
-rw-r--r-- 1 root root 26846 Mar 1 2021 libnppc_static.a
lrwxrwxrwx 1 root root 15 Mar 1 2021 libnppial.so -> libnppial.so.10
lrwxrwxrwx 1 root root 23 Mar 1 2021 libnppial.so.10 -> libnppial.so.10.2.1.300
-rw-r--r-- 1 root root 11509472 Mar 1 2021 libnppial.so.10.2.1.300
-rw-r--r-- 1 root root 14410930 Mar 1 2021 libnppial_static.a
lrwxrwxrwx 1 root root 15 Mar 1 2021 libnppicc.so -> libnppicc.so.10
lrwxrwxrwx 1 root root 23 Mar 1 2021 libnppicc.so.10 -> libnppicc.so.10.2.1.300
-rw-r--r-- 1 root root 4914920 Mar 1 2021 libnppicc.so.10.2.1.300
-rw-r--r-- 1 root root 5722536 Mar 1 2021 libnppicc_static.a
lrwxrwxrwx 1 root root 16 Mar 1 2021 libnppicom.so -> libnppicom.so.10
lrwxrwxrwx 1 root root 24 Mar 1 2021 libnppicom.so.10 -> libnppicom.so.10.2.1.300
-rw-r--r-- 1 root root 1453728 Mar 1 2021 libnppicom.so.10.2.1.300
-rw-r--r-- 1 root root 1093680 Mar 1 2021 libnppicom_static.a
lrwxrwxrwx 1 root root 16 Mar 1 2021 libnppidei.so -> libnppidei.so.10
lrwxrwxrwx 1 root root 24 Mar 1 2021 libnppidei.so.10 -> libnppidei.so.10.2.1.300
-rw-r--r-- 1 root root 8175688 Mar 1 2021 libnppidei.so.10.2.1.300
-rw-r--r-- 1 root root 10762478 Mar 1 2021 libnppidei_static.a
lrwxrwxrwx 1 root root 14 Mar 1 2021 libnppif.so -> libnppif.so.10
lrwxrwxrwx 1 root root 22 Mar 1 2021 libnppif.so.10 -> libnppif.so.10.2.1.300
-rw-r--r-- 1 root root 54362944 Mar 1 2021 libnppif.so.10.2.1.300
-rw-r--r-- 1 root root 58471042 Mar 1 2021 libnppif_static.a
lrwxrwxrwx 1 root root 14 Mar 1 2021 libnppig.so -> libnppig.so.10
lrwxrwxrwx 1 root root 22 Mar 1 2021 libnppig.so.10 -> libnppig.so.10.2.1.300
-rw-r--r-- 1 root root 28761920 Mar 1 2021 libnppig.so.10.2.1.300
-rw-r--r-- 1 root root 31432462 Mar 1 2021 libnppig_static.a
lrwxrwxrwx 1 root root 14 Mar 1 2021 libnppim.so -> libnppim.so.10
lrwxrwxrwx 1 root root 22 Mar 1 2021 libnppim.so.10 -> libnppim.so.10.2.1.300
-rw-r--r-- 1 root root 7163640 Mar 1 2021 libnppim.so.10.2.1.300
-rw-r--r-- 1 root root 7396476 Mar 1 2021 libnppim_static.a
lrwxrwxrwx 1 root root 15 Mar 1 2021 libnppist.so -> libnppist.so.10
lrwxrwxrwx 1 root root 23 Mar 1 2021 libnppist.so.10 -> libnppist.so.10.2.1.300
-rw-r--r-- 1 root root 20877336 Mar 1 2021 libnppist.so.10.2.1.300
-rw-r--r-- 1 root root 23399160 Mar 1 2021 libnppist_static.a
lrwxrwxrwx 1 root root 15 Mar 1 2021 libnppisu.so -> libnppisu.so.10
lrwxrwxrwx 1 root root 23 Mar 1 2021 libnppisu.so.10 -> libnppisu.so.10.2.1.300
-rw-r--r-- 1 root root 486576 Mar 1 2021 libnppisu.so.10.2.1.300
-rw-r--r-- 1 root root 11458 Mar 1 2021 libnppisu_static.a
lrwxrwxrwx 1 root root 15 Mar 1 2021 libnppitc.so -> libnppitc.so.10
lrwxrwxrwx 1 root root 23 Mar 1 2021 libnppitc.so.10 -> libnppitc.so.10.2.1.300
-rw-r--r-- 1 root root 3112480 Mar 1 2021 libnppitc.so.10.2.1.300
-rw-r--r-- 1 root root 3205362 Mar 1 2021 libnppitc_static.a
lrwxrwxrwx 1 root root 13 Mar 1 2021 libnpps.so -> libnpps.so.10
lrwxrwxrwx 1 root root 21 Mar 1 2021 libnpps.so.10 -> libnpps.so.10.2.1.300
-rw-r--r-- 1 root root 9539760 Mar 1 2021 libnpps.so.10.2.1.300
-rw-r--r-- 1 root root 10690508 Mar 1 2021 libnpps_static.a
lrwxrwxrwx 1 root root 15 Mar 1 2021 libnvblas.so -> libnvblas.so.10
lrwxrwxrwx 1 root root 23 Mar 1 2021 libnvblas.so.10 -> libnvblas.so.10.2.3.300
-rw-r--r-- 1 root root 540232 Mar 1 2021 libnvblas.so.10.2.3.300
lrwxrwxrwx 1 root root 16 Mar 1 2021 libnvgraph.so -> libnvgraph.so.10
lrwxrwxrwx 1 root root 22 Mar 1 2021 libnvgraph.so.10 -> libnvgraph.so.10.2.300
-rw-r--r-- 1 root root 165012616 Mar 1 2021 libnvgraph.so.10.2.300
-rw-r--r-- 1 root root 168141386 Mar 1 2021 libnvgraph_static.a
-rw-r--r-- 1 root root 7430712 Mar 1 2021 libnvperf_host.so
-rw-r--r-- 1 root root 1096016 Mar 1 2021 libnvperf_target.so
lrwxrwxrwx 1 root root 25 Mar 1 2021 libnvrtc-builtins.so -> libnvrtc-builtins.so.10.2
lrwxrwxrwx 1 root root 29 Mar 1 2021 libnvrtc-builtins.so.10.2 -> libnvrtc-builtins.so.10.2.300
-rw-r--r-- 1 root root 4794168 Mar 1 2021 libnvrtc-builtins.so.10.2.300
lrwxrwxrwx 1 root root 16 Mar 1 2021 libnvrtc.so -> libnvrtc.so.10.2
lrwxrwxrwx 1 root root 20 Mar 1 2021 libnvrtc.so.10.2 -> libnvrtc.so.10.2.300
-rw-r--r-- 1 root root 20432800 Mar 1 2021 libnvrtc.so.10.2.300
lrwxrwxrwx 1 root root 18 Mar 1 2021 libnvToolsExt.so -> libnvToolsExt.so.1
lrwxrwxrwx 1 root root 22 Mar 1 2021 libnvToolsExt.so.1 -> libnvToolsExt.so.1.0.0
-rw-r--r-- 1 root root 44088 Mar 1 2021 libnvToolsExt.so.1.0.0
drwxr-xr-x 2 root root 4096 Oct 10 14:35 stubs
nvcc
from host:
$ /usr/local/cuda-10.2/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:34:44_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0
nvcc
from within either container:
$ /usr/local/cuda-10.2/bin/nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_28_22:34:44_PST_2021
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0
I am having the exact same issue with Jetpack 4.6.0 on the following system:
- NVIDIA Jetson AGX Xavier [16GB]
* Jetpack 4.6 [L4T 32.6.1]
* NV Power Mode: MODE_30W_ALL - Type: 3
* jetson_stats.service: active
- Libraries:
* CUDA: 10.2.300
* cuDNN: 8.2.1.32
* TensorRT: 8.0.1.6
* Visionworks: 1.6.0.501
* OpenCV: 4.1.1 compiled CUDA: NO
* VPI: ii libnvvpi1 1.1.15 arm64 NVIDIA Vision Programming Interface library
* Vulkan: 1.2.70
docker version:
Client:
Version: 20.10.7
API version: 1.41
Go version: go1.13.8
Git commit: 20.10.7-0ubuntu5~18.04.3
Built: Mon Nov 1 01:04:31 2021
OS/Arch: linux/arm64
Context: default
Experimental: true
Server:
Engine:
Version: 20.10.7
API version: 1.41 (minimum version 1.12)
Go version: go1.13.8
Git commit: 20.10.7-0ubuntu5~18.04.3
Built: Fri Oct 22 00:57:37 2021
OS/Arch: linux/arm64
Experimental: false
containerd:
Version: 1.5.5-0ubuntu3~18.04.2
GitCommit:
runc:
Version: 1.0.1-0ubuntu2~18.04.1
GitCommit:
docker-init:
Version: 0.19.0
GitCommit:
and following packages setup:
libnvidia-container-tools/stable,now 0.10.0+jetpack arm64 [installed]
libnvidia-container0/stable,now 0.10.0+jetpack arm64 [installed,automatic]
nvidia-container-csv-cuda/stable,now 10.2.460-1 arm64 [installed]
nvidia-container-csv-cudnn/stable,now 8.2.1.32-1+cuda10.2 arm64 [installed]
nvidia-container-csv-tensorrt/stable,now 8.0.1.6-1+cuda10.2 arm64 [installed]
nvidia-container-csv-visionworks/stable,now 1.6.0.501 arm64 [installed]
nvidia-container-runtime/stable,now 3.1.0-1 arm64 [installed,automatic]
nvidia-container-toolkit/stable,now 1.0.1-1 arm64 [installed,automatic]
nvidia-docker2/stable,now 2.2.0-1 all [installed]
nvidia-l4t-3d-core/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-apt-source/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-bootloader/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-camera/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-configs/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-core/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-cuda/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-firmware/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-gputools/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-graphics-demos/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-gstreamer/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-init/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-initrd/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-jetson-io/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-jetson-multimedia-api/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-kernel/stable,now 4.9.253-tegra-32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-kernel-dtbs/stable,now 4.9.253-tegra-32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-kernel-headers/stable,now 4.9.253-tegra-32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-libvulkan/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-multimedia/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-multimedia-utils/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-oem-config/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-tools/stable,now 32.6.1-20210726122859 arm64 [installed]
nvidia-l4t-wayland/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-weston/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-x11/stable,now 32.6.1-20210916210945 arm64 [installed]
nvidia-l4t-xusb-firmware/stable,now 32.6.1-20210726122859 arm64 [installed]
The problem happens when trying to build a custom docker image using nvcr.io/nvidia/l4t-ml:r32.6.1-py3
as base and installing the mmcv-full library. Looking within the container, similar to @michael-sbarra, I can also see that libcurand.so.10 exists and is properly linked. LD_LIBRARY_PATH as well as PATH contain the path to as well.
@hvalev Are you sure you are seeing an error of:
could not start driver service: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
With the packages you list, that should not be possible.
No, the error that I am seeing is:
>>> import torchvision.transforms as transforms
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/__init__.py", line 6, in <module>
from torchvision import models
File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/models/__init__.py", line 1, in <module>
from .alexnet import *
File "/usr/local/lib/python3.6/dist-packages/torchvision-0.10.0a0+300a8a4-py3.6-linux-aarch64.egg/torchvision/models/alexnet.py", line 1, in <module>
import torch
File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 196, in <module>
_load_global_deps()
File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 149, in _load_global_deps
ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL)
File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__
self._handle = _dlopen(self._name, mode)
OSError: libcurand.so.10: cannot open shared object file: No such file or directory
when I'm building the mmcv-full package with pip3
How does this relate to the original issue? In any case, it does not seem to be related to nvidia-docker itself, but rather one of the images running on top of it. You would have better luck posting in: https://forums.developer.nvidia.com/c/accelerated-computing/nvidia-gpu-cloud-ngc-users/docker-and-nvidia-docker/33
Because it is the exact same issue as @michael-sbarra is having. If you think so, however, the command pip3 install mmcv-full
executes from within the container (if I build it just before that command and then exec into it and run it manually), but does not run from within the docker build context (docker build .
) where it fails with the aforementioned error message. Anyhow, I'm open to recommendations where I can take this issue.
@hvalev as a matter of interest, which nvidia-container-csv-*
packages are available for installation from the Jetpack repositories?
Hi, sorry for the delay. I actually need to correct myself slightly. Indeed, the problem I was experiencing was due to the fact that I did not set the nvidia-runtime as the default runtime in /etc/docker/daemon.json
. This discrepancy caused the difference in the contexts for 1) building a docker image from a Dockerfile and 2) actually running an already built image as a container, where I was explicitly indicating the runtime in the docker run command. For reference, I'm pasting below the daemon.json config.
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
I am closing this as it seems to have been caused by an incorrectly configured Docker daemon. If there are still problems, please reopen this or create a new issue.
dpkg-nvidia.log
1. Issue or feature description
Reinstalling Nvidia-Docker not able to run Nvidia Toolkit Containers (Jetson Nano - Jetpack 4.5.1)
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
nvidia-container-cli -k -d /dev/tty info
user@nano:~$ nvidia-container-cli -k -d /dev/tty info-- WARNING, the following logs are for debugging purposes only --
I0412 19:56:22.542498 6496 nvc.c:372] initializing library context (version=1.3.3, build=bd9fc3f2b642345301cb2e23de07ec5386232317) I0412 19:56:22.542617 6496 nvc.c:346] using root / I0412 19:56:22.542633 6496 nvc.c:347] using ldcache /etc/ld.so.cache I0412 19:56:22.542647 6496 nvc.c:348] using unprivileged user 1000:1000 I0412 19:56:22.542726 6496 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I0412 19:56:22.543074 6496 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment W0412 19:56:22.543425 6496 nvc.c:254] failed to detect NVIDIA devices W0412 19:56:22.543857 6497 nvc.c:269] failed to set inheritable capabilities W0412 19:56:22.543961 6497 nvc.c:270] skipping kernel modules load due to failure I0412 19:56:22.544487 6498 driver.c:101] starting driver service E0412 19:56:22.545001 6498 driver.c:161] could not start driver service: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory I0412 19:56:22.545295 6496 driver.c:196] driver service terminated successfully nvidia-container-cli: initialization error: driver error: failed to process request
[ ] Kernel version from `uname -a Linux nano 4.9.201-tegra NVIDIA/nvidia-docker#1 SMP PREEMPT Fri Feb 19 08:40:32 PST 2021 aarch64 aarch64 aarch64 GNU/Linux
[ ]
[ ] Any relevant kernel output lines from
dmesg
[ ] Driver information from
nvidia-smi -a
./deviceQuery Starting...CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA Tegra X1" CUDA Driver Version / Runtime Version 10.2 / 10.2 CUDA Capability Major/Minor version number: 5.3 Total amount of global memory: 3964 MBytes (4156694528 bytes) ( 1) Multiprocessors, (128) CUDA Cores/MP: 128 CUDA Cores GPU Max Clock rate: 922 MHz (0.92 GHz) Memory Clock rate: 13 Mhz Memory Bus Width: 64-bit L2 Cache Size: 262144 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 32768 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 1 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: Yes Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: No Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 0 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1 Result = PASS
[ ] Docker version from
docker version
Docker version 19.03.15, build 99e3ed8[ ] NVIDIA packages version from (no description available)
ii libnvidia-container-tools 1.3.3-1 arm64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container0:arm64 0.9.0~beta.1 arm64 NVIDIA container runtime library
ii libnvidia-container1:arm64 1.3.3-1 arm64 NVIDIA container runtime library
un nvidia-304 (no description available)
un nvidia-340 (no description available)
un nvidia-384 (no description available)
un nvidia-common (no description available)
ii nvidia-container-csv-cuda 10.2.89-1 arm64 Jetpack CUDA CSV file
ii nvidia-container-csv-cudnn 8.0.0.180-1+cuda10. arm64 Jetpack CUDNN CSV file
ii nvidia-container-csv-tensorr 7.1.3.0-1+cuda10.2 arm64 Jetpack TensorRT CSV file
ii nvidia-container-csv-visionw 1.6.0.501 arm64 Jetpack VisionWorks CSV file
ii nvidia-container-runtime 3.4.2-1 arm64 NVIDIA container runtime
un nvidia-container-runtime-hoo (no description available)
ii nvidia-container-toolkit 1.4.2-1 arm64 NVIDIA container runtime hook
un nvidia-cuda-dev (no description available)
ii nvidia-l4t-3d-core 32.5.1-202102190845 arm64 NVIDIA GL EGL Package
ii nvidia-l4t-apt-source 32.5.1-202102190845 arm64 NVIDIA L4T apt source list debian package
ii nvidia-l4t-bootloader 32.5.1-202102190845 arm64 NVIDIA Bootloader Package
ii nvidia-l4t-camera 32.5.1-202102190845 arm64 NVIDIA Camera Package
un nvidia-l4t-ccp-t210ref (no description available)
ii nvidia-l4t-configs 32.5.1-202102190845 arm64 NVIDIA configs debian package
ii nvidia-l4t-core 32.5.1-202102190845 arm64 NVIDIA Core Package
ii nvidia-l4t-cuda 32.5.1-202102190845 arm64 NVIDIA CUDA Package
ii nvidia-l4t-firmware 32.5.1-202102190845 arm64 NVIDIA Firmware Package
ii nvidia-l4t-graphics-demos 32.5.1-202102190845 arm64 NVIDIA graphics demo applications
ii nvidia-l4t-gstreamer 32.5.1-202102190845 arm64 NVIDIA GST Application files
ii nvidia-l4t-init 32.5.1-202102190845 arm64 NVIDIA Init debian package
ii nvidia-l4t-initrd 32.5.1-202102190845 arm64 NVIDIA initrd debian package
ii nvidia-l4t-jetson-io 32.5.1-202102190845 arm64 NVIDIA Jetson.IO debian package
ii nvidia-l4t-jetson-multimedia 32.5.1-202102190845 arm64 NVIDIA Jetson Multimedia API is a collection of lower-level AP
ii nvidia-l4t-kernel 4.9.201-tegra-32.5. arm64 NVIDIA Kernel Package
ii nvidia-l4t-kernel-dtbs 4.9.201-tegra-32.5. arm64 NVIDIA Kernel DTB Package
ii nvidia-l4t-kernel-headers 4.9.201-tegra-32.5. arm64 NVIDIA Linux Tegra Kernel Headers Package
ii nvidia-l4t-multimedia 32.5.1-202102190845 arm64 NVIDIA Multimedia Package
ii nvidia-l4t-multimedia-utils 32.5.1-202102190845 arm64 NVIDIA Multimedia Package
ii nvidia-l4t-oem-config 32.5.1-202102190845 arm64 NVIDIA OEM-Config Package
ii nvidia-l4t-tools 32.5.1-202102190845 arm64 NVIDIA Public Test Tools Package
ii nvidia-l4t-wayland 32.5.1-202102190845 arm64 NVIDIA Wayland Package
ii nvidia-l4t-weston 32.5.1-202102190845 arm64 NVIDIA Weston Package
ii nvidia-l4t-x11 32.5.1-202102190845 arm64 NVIDIA X11 Package
ii nvidia-l4t-xusb-firmware 32.5.1-202102190845 arm64 NVIDIA USB Firmware Package
un nvidia-libopencl1-dev (no description available)
un nvidia-prime (no description available)
dpkg -l '*nvidia*'
orrpm -qa '*nvidia*'
||/ Name Version Architecture Description +++-============================-===================-===================-============================================================== un libgldispatch0-nvidia[ ] NVIDIA container library version from
nvidia-container-cli -V
version: 1.3.3 build date: 2021-02-05T13:33+00:00 build revision: bd9fc3f2b642345301cb2e23de07ec5386232317 build compiler: aarch64-linux-gnu-gcc-7 7.5.0 build platform: aarch64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections[ ] NVIDIA container library logs (see troubleshooting)
[ ] Docker command, image and tag used
Description: I started to remove Iptables (sudo apt remove iptables) and the package manager removed Docker and Nvidia-Docker in the process. All of the nvidia packages (sudo dpkg-query -l | grep nvidia) seem to be intact, however.
Other than reflashing the Jetson Nano with the SD Card image for Jetpack 4.5.1, is there a way to simply reinstall the correct version of Docker and Nvidia-Docker2 that was used in the original Jetpack image (https://developer.nvidia.com/jetson-nano-sd-card-image.zip)?
Also, can Docker be reinstalled with sudo or does Docker have to be installed as root (sudo su)?
I tried to reinstall Docker by following the instructions at the following links:
Docker version 19.03 is working on the Nano. Verified by running " sudo docker run hello-world" Nvidia-Docker (NVIDIA Container Toolkit) was also installed successfully.
However, when I verified the install using the following command, I ran into a run-time error: (Driver issue?)
Command: (Cuda 10) docker run --gpus all -it --rm --network host --volume ~/nvdli-data:/nvdli-nano/data --device /dev/video0 nvcr.io/nvidia/dli/dli-nano-ai:v2.0.1-r32.5.0
Error: docker: Error response from daemon: OCI runtime create failed: container_linux.g o:367: starting container process caused: process_linux.go:495: container init c aused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nv idia-container-cli: initialization error: driver error: failed to process reques t: unknown.
Nvidia's Response: "The nvidia-docker2_2.2.0-1_all.deb is included in the JetPack. Please use SDKmanager to download it (click reflash just for downloading the package)."
Can the debian package be downloaded/installed directly from the Nano using Apt package manager?
user@nano:~$ cat /etc/apt/sources.list.d/nvidia-container-runtime.list*
deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /
user@nano:~$ cat /etc/apt/sources.list.d/nvidia-docker.list
deb https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/libnvidia-container/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/stable/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-container-runtime/experimental/ubuntu18.04/$(ARCH) /
deb https://nvidia.github.io/nvidia-docker/ubuntu18.04/$(ARCH) /