dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
1.9k stars 416 forks source link

"ImportError: libnvrm.so" in OpenCV container on fresh Jetson Nano install #456

Closed JTylerBoylan closed 3 months ago

JTylerBoylan commented 3 months ago

I am trying to run the opencv jetson container on my brand new Jetson Nano Developer Kit.

The exact steps I took from flashing the Jetson:

  1. Flashed the Jetson Nano with the image from the official Developer Kit Guide.
  2. Updated and upgraded from Terminal
  3. Add $USER to docker group
  4. Reboot
  5. Clone jetson-containers repo and installed dependencies
  6. Set docker default runtime to nvidia
  7. ./run.sh $(./autotag opencv)
  8. In container: python3 -c "import cv2"

Output:

nvidia@jetson-nano:~/jetson-containers$ ./run.sh $(./autotag opencv)
Namespace(disable=[''], output='/tmp/autotag', packages=['opencv'], prefer=['local', 'registry', 'build'], quiet=False, user='dustynv', verbose=False)
-- L4T_VERSION=32.7.4  JETPACK_VERSION=4.6.4  CUDA_VERSION=10.2.300
-- Finding compatible container image for ['opencv']

Found compatible container dustynv/opencv:4.8.1-r36.2.0 (2023-12-07, 5.1GB) - would you like to pull it? [Y/n] Y
dustynv/opencv:4.8.1-r36.2.0
localuser:root being added to access control list
xauth:  file /tmp/.docker.xauth does not exist
+ docker run --runtime nvidia -it --rm --network host --volume /tmp/argus_socket:/tmp/argus_socket --volume /etc/enctune.conf:/etc/enctune.conf --volume /etc/nv_tegra_release:/etc/nv_tegra_release --volume /tmp/nv_jetson_model:/tmp/nv_jetson_model --volume /var/run/dbus:/var/run/dbus --volume /var/run/avahi-daemon/socket:/var/run/avahi-daemon/socket --volume /var/run/docker.sock:/var/run/docker.sock --volume /home/nvidia/jetson-containers/data:/data --device /dev/snd --device /dev/bus/usb -e DISPLAY=:0 -v /tmp/.X11-unix/:/tmp/.X11-unix -v /tmp/.docker.xauth:/tmp/.docker.xauth -e XAUTHORITY=/tmp/.docker.xauth dustynv/opencv:4.8.1-r36.2.0
Unable to find image 'dustynv/opencv:4.8.1-r36.2.0' locally
4.8.1-r36.2.0: Pulling from dustynv/opencv
bfbe77e41a78: Pull complete 
ef8924b3a5a5: Pull complete 
a1644e8aa54d: Pull complete 
61787b9cb382: Pull complete 
9eed1314ef49: Pull complete 
b20423b6e2ce: Pull complete 
a20029456556: Pull complete 
c2b4fe089356: Pull complete 
9f7f75397d9a: Pull complete 
95fe6423d877: Pull complete 
998d899205bf: Pull complete 
2c17d83b3ed9: Pull complete 
d828ec4b9037: Pull complete 
c7412c95f700: Pull complete 
4fbcbf5d2a7b: Pull complete 
75b774db09e6: Pull complete 
4528152c3da5: Pull complete 
994996c438c1: Pull complete 
Digest: sha256:03b8026b4d2791deddd31aabdbbb8ccdf6f4422e5941b86be737e247455845e7
Status: Downloaded newer image for dustynv/opencv:4.8.1-r36.2.0
root@jetson-nano:/# python3 -c "import cv2"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 181, in <module>
    bootstrap()
  File "/usr/local/lib/python3.10/dist-packages/cv2/__init__.py", line 153, in bootstrap
    native_module = importlib.import_module("cv2")
  File "/usr/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
ImportError: libnvrm.so: cannot open shared object file: No such file or directory

The library does exist in the container:

root@jetson-nano:/# ls /usr/lib/aarch64-linux-gnu/tegra | grep libnvrm
libnvrm.so
libnvrm_gpu.so
libnvrm_graphics.so

How do I fix this?

Thank you!

dusty-nv commented 3 months ago

Hi @JTylerBoylan, to help with debugging this, can you check that you are able to run other containers successfully and use CUDA in them, like l4t-jetpack (try running some CUDA samples like deviceQuery/bandwidthTest/vectorAdd) and l4t-pytorch ?

JTylerBoylan commented 3 months ago

I tried the deviceQuery sample and it failed:

root@jetson-nano:/# cp -r /usr/local/cuda/samples /tmp
root@jetson-nano:/# cd /tmp/samples/1_Utilities/deviceQuery
root@jetson-nano:/tmp/samples/1_Utilities/deviceQuery# make
/usr/local/cuda-11.4/bin/nvcc -ccbin g++ -I../../common/inc  -m64    --threads 0 --std=c++11 -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery.o -c deviceQuery.cpp
/usr/local/cuda-11.4/bin/nvcc -ccbin g++   -m64      -gencode arch=compute_53,code=sm_53 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_62,code=sm_62 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_72,code=sm_72 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_87,code=sm_87 -gencode arch=compute_87,code=compute_87 -o deviceQuery deviceQuery.o 
mkdir -p ../../bin/aarch64/linux/release
cp deviceQuery ../../bin/aarch64/linux/release
root@jetson-nano:/tmp/samples/1_Utilities/deviceQuery# ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

cudaGetDeviceCount returned 3
-> initialization error
Result = FAIL
JTylerBoylan commented 3 months ago

Another note that may be relevant is that when upgrading (sudo apt upgrade -y), I was asked about overriding some files with the package maintainer's version, which I entered 'Y' to override. And then later in the upgrade I got some warnings and errors that looked like this:

WARNING: missing /lib/modules/4.9.253-tegra Ensure all necessary drivers are built into the linux image! 
depmod: ERROR: could not open directory /lib/modules/4.9.253-tegra: No such file or directory

Not sure if that could be part of the issue.

dusty-nv commented 3 months ago

OK yea, if you did an apt upgrade it probably broke your docker daemon due to this recent issue upstream: https://github.com/dusty-nv/jetson-inference/issues/1795

JTylerBoylan commented 3 months ago

Ok I will try re-flashing, and not upgrading.

Also another note, the deviceQuery test passed when I used base image nvcr.io/nvidia/l4t-base:r32.7.1

Host release:

nvidia@jetson-nano:~$ cat /etc/nv_tegra_release 
# R32 (release), REVISION: 7.4, GCID: 33514132, BOARD: t210ref, EABI: aarch64, DATE: Fri Jun  9 04:25:08 UTC 2023

In container:

root@jetson-nano:/samples/1_Utilities/deviceQuery# ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA Tegra X1"
  CUDA Driver Version / Runtime Version          10.2 / 10.2
  CUDA Capability Major/Minor version number:    5.3
  Total amount of global memory:                 3956 MBytes (4148043776 bytes)
  ( 1) Multiprocessors, (128) CUDA Cores/MP:     128 CUDA Cores
  GPU Max Clock rate:                            922 MHz (0.92 GHz)
  Memory Clock rate:                             13 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 262144 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            Yes
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            No
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 0 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS
dusty-nv commented 3 months ago

That's interesting about l4t-base, didn't put it together you were on JetPack 4. All the CUDA stuff is mounted from host into l4t-base on JP4. I'm not sure why that works and the others don't though.

JTylerBoylan commented 3 months ago

The re-flashing didn't work, but I found the solution. The container works when I specify the container directly without autotag:

./run.sh dustynv/opencv:r32.7.1

I guess autotag didn't pick up the correct version.

Thanks for the help!

dusty-nv commented 3 months ago

ooo, okay thanks - you're right, it tried to run dustynv/opencv:4.8.1-r36.2.0 for some reason, here autotag works correctly on JP4 still...will look into it