dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
2.23k stars 456 forks source link

Cannot build image due to `ImportError: libcublas.so.10: cannot open shared object file: No such file or directory` #222

Open ggoretkin-bdai opened 1 year ago

ggoretkin-bdai commented 1 year ago

(Same error message as https://github.com/dusty-nv/jetson-containers/issues/159 , but occurring during image build time)

I am running into the error message

ImportError: libcublas.so.10: cannot open shared object file: No such file or directory

while building an image, since the Dockerfile runs `import cv2

I have not yet gotten to the point of seeing if the runtime docker environment is providing libcublas.so.10. I am trying to build the image on an x86 host that doesn't have NVidia hardware. Is this possible?

ggoretkin-bdai commented 1 year ago

It seems like the l4t images are missing these files:

$ docker run --platform linux/aarch64 --rm -it nvcr.io/nvidia/l4t-base:r32.5.0 find / -name "libcublas*" 2>&1 | grep -v "Permission denied"

$ docker run --platform linux/aarch64 --rm -it nvcr.io/nvidia/l4t-base:r32.7.1 find / -name "libcublas*" 2>&1 | grep -v "Permission denied"
/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublasLt.so
/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublas.so

whereas if I do it directly on a r32.7.2 native installation:

$ find / -name "libcublas*" 2>&1 | grep -v "Permission denied"
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublas.so
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublas.so.10
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublas.so.10.2.3.300
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublasLt.so
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublasLt.so.10
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublasLt.so.10.2.3.300
/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublas.so
/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublasLt.so
/usr/share/doc/libcublas-dev
/usr/share/doc/libcublas10
/var/lib/dpkg/info/libcublas-dev.list
/var/lib/dpkg/info/libcublas-dev.md5sums
/var/lib/dpkg/info/libcublas10.list
/var/lib/dpkg/info/libcublas10.md5sums

r32.5.0 is the base image used here: https://github.com/dusty-nv/jetson-containers/blob/53a882c1d373301d2998f3026a1028ee36611853/Dockerfile.ros.humble#L5

ggoretkin-bdai commented 1 year ago

ah, sorry, the last message is not quite relevant. This is the valid test (need the nvidia runtime)

$ sudo docker run --runtime=nvidia --platform linux/aarch64 --rm -it nvcr.io/nvidia/l4t-base:r32.7.1 find / -name "libcublas*" 2>&1 | grep -v "Permission denied"
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublas.so
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublas.so.10
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublas.so.10.2.3.300
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublasLt.so
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublasLt.so.10
/usr/local/cuda-10.2/targets/aarch64-linux/lib/libcublasLt.so.10.2.3.300
/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublas.so
/usr/local/cuda-10.2/targets/aarch64-linux/lib/stubs/libcublasLt.so

But then it is still unclear to me how to build the docker image.

ggoretkin-bdai commented 1 year ago

From https://github.com/dusty-nv/jetson-containers/issues/205#issuecomment-1310393417 it seems that the intention is not that this image can be built on x86. Building it on the Jetson means that the default docker runtime is nvidia, and this error does not show up.

dusty-nv commented 1 year ago

Thats correct, I've only built these containers natively on Jetson (not x86). Sorry for the confusion.


From: ggoretkin-bdai @.> Sent: Friday, February 10, 2023 2:31:20 PM To: dusty-nv/jetson-containers @.> Cc: Subscribed @.***> Subject: Re: [dusty-nv/jetson-containers] Cannot build image due to ImportError: libcublas.so.10: cannot open shared object file: No such file or directory (Issue #222)

From #205 (comment)https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdusty-nv%2Fjetson-containers%2Fissues%2F205%23issuecomment-1310393417&data=05%7C01%7Cdustinf%40nvidia.com%7Cb153f849ee1a48e8af2608db0b9d6352%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638116542824748036%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hfofPHesNAEFIJWQJ00mRR%2FA6%2B6GNCqvuFp7Dg3MIB4%3D&reserved=0 it seems that the intention is not that this image can be built on x86. Building it on the Jetson means that the default docker runtime is nvidia, and this error does not show up.

— Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fdusty-nv%2Fjetson-containers%2Fissues%2F222%23issuecomment-1426246589&data=05%7C01%7Cdustinf%40nvidia.com%7Cb153f849ee1a48e8af2608db0b9d6352%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638116542824748036%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=WXwYpQRKoxNqVFgkf8TuLwxlbRv4cn7Avj3x2beZ72c%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FADVEGKYDZPP3EVOUM7AJW6DWW2JQRANCNFSM6AAAAAAUW7V7IY&data=05%7C01%7Cdustinf%40nvidia.com%7Cb153f849ee1a48e8af2608db0b9d6352%7C43083d15727340c1b7db39efd9ccc17a%7C0%7C0%7C638116542824748036%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=RUgPsTMq3CqLaVLfaq8xl8qjT4sNAbEshS5L8XUO7uk%3D&reserved=0. You are receiving this because you are subscribed to this thread.Message ID: @.***>

ByerRA commented 1 year ago

I've ran into this issue today when trying to build the "l4t-ml" image on my Jetson Nano. When torch is installed and it tries to test the installation I get the same error message.

dusty-nv commented 1 year ago

@ByerRA is your default docker runtime set to NVIDIA? https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md#docker-default-runtime

ByerRA commented 1 year ago

@ByerRA is your default docker runtime set to NVIDIA? https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md#docker-default-runtime

Yes, NVIDIA is set as the default runtime.

$ sudo docker info | grep -i "Default Runtime" Default Runtime: nvidia

I can confirm that the "nvcr.io/nvidia/l4t-base:r32.7.1" image which is the image being selected as the base build image in the build script does not have the "libcublas.so.10" library.

dusty-nv commented 1 year ago

@ByerRA on JetPack 4, the CUDA/cuDNN/TensorRT libraries get mounted into container by --runtime nvidia from the host device - do you have libcublas.so.10 under /usr/local/cuda/lib64/ ?

If you start l4t-base like this, it should be there - otherwise there's an issue with your NVIDIA Container Runtime:

sudo docker run -it --rm --net=host --runtime nvidia -e DISPLAY=$DISPLAY -v /tmp/.X11-unix/:/tmp/.X11-unix nvcr.io/nvidia/l4t-base:r32.7.1
ByerRA commented 1 year ago

@ByerRA on JetPack 4, the CUDA/cuDNN/TensorRT libraries get mounted into container by --runtime nvidia from the host device - do you have libcublas.so.10 under /usr/local/cuda/lib64/ ?

If you start l4t-base like this, it should be there - otherwise there's an issue with your NVIDIA Container Runtime:

sudo docker run -it --rm --net=host --runtime nvidia -e DISPLAY=$DISPLAY -v /tmp/.X11-unix/:/tmp/.X11-unix nvcr.io/nvidia/l4t-base:r32.7.1

From my Jetson Nano, if I do the following...

$ sudo docker run -it --rm --net=host --runtime nvidia -e DISPLAY=$DISPLAY -v /tmp/.X11-unix/:/tmp/.X11-unix nvcr.io/nvidia/l4t-base:r32.7.1

And I check the "/usr/local/cuda/lib64/" directory in the "nvcr.io/nvidia/l4t-base:r32.7.1" container, this is what I get...

$ ls -afl /usr/local/cuda/lib64/ total 1544 drwxr-xr-x 1 root root 4096 Dec 15 2021 . drwxr-xr-x 1 root root 4096 Dec 15 2021 .. drwxr-xr-x 2 root root 4096 Dec 15 2021 stubs -rw-r--r-- 1 root root 888074 Dec 15 2021 libcudart_static.a -rw-r--r-- 1 root root 679636 Dec 15 2021 libcudadevrt.a $

If I follow the directions and do the following...

$ sudo apt-get update && sudo apt-get install git python3-pip $ git clone https://github.com/dusty-nv/jetson-containers $ cd jetson-containers $ pip3 install -r requirements.txt $ ./build.sh --name=jetson-l4t-ml l4t-ml

The build executes and halts with the following error...

Step 8/11 : RUN python3 -c 'import torch; print(f"PyTorch version: {torch.__version__}"); print(f"CUDA available: {torch.cuda.is_available()}"); print(f"cuDNN version: {torch.backends.cudnn.version()}"); print(torch.__config__.show());' ---> Running in 24c4defcacf1 Traceback (most recent call last): File "<string>", line 1, in <module> File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 196, in <module> _load_global_deps() File "/usr/local/lib/python3.6/dist-packages/torch/__init__.py", line 149, in _load_global_deps ctypes.CDLL(lib_path, mode=ctypes.RTLD_GLOBAL) File "/usr/lib/python3.6/ctypes/__init__.py", line 348, in __init__ self._handle = _dlopen(self._name, mode) OSError: libcurand.so.10: cannot open shared object file: No such file or directory The command '/bin/sh -c python3 -c 'import torch; print(f"PyTorch version: {torch.__version__}"); print(f"CUDA available: {torch.cuda.is_available()}"); print(f"cuDNN version: {torch.backends.cudnn.version()}"); print(torch.__config__.show());'' returned a non-zero code: 1

dusty-nv commented 1 year ago

It's not building l4t-ml because the underlying function that mounts your CUDA Runtime libraries into the containers aren't working.

What does ls -afl /usr/local/cuda/lib64/ show outside of the container?

Had you made any other changes or upgrades to the OS, CUDA Toolkit on your device or the nvidia-container-runtime?

ByerRA commented 1 year ago

It's not building l4t-ml because the underlying function that mounts your CUDA Runtime libraries into the containers aren't working.

What does ls -afl /usr/local/cuda/lib64/ show outside of the container?

Had you made any other changes or upgrades to the OS, CUDA Toolkit on your device or the nvidia-container-runtime?

On my Jetson Nano with Jetpack 4.6.4 [L4T 32.7.4] that I'm trying to build from I get the following..

$ ls -afl /usr/local/cuda/lib64/ total 2259952 -rw-r--r-- 1 root root 5722536 Mar 1 2021 libnppicc_static.a lrwxrwxrwx 1 root root 15 Mar 1 2021 libcublas.so -> libcublas.so.10 lrwxrwxrwx 1 root root 18 Mar 1 2021 libnvToolsExt.so -> libnvToolsExt.so.1 lrwxrwxrwx 1 root root 16 Mar 1 2021 libnppidei.so -> libnppidei.so.10 -rw-r--r-- 1 root root 4794168 Mar 1 2021 libnvrtc-builtins.so.10.2.300 lrwxrwxrwx 1 root root 23 Mar 1 2021 libnppitc.so.10 -> libnppitc.so.10.2.1.300 drwxr-xr-x 3 root root 4096 Aug 23 15:49 ./ lrwxrwxrwx 1 root root 17 Mar 1 2021 libcudart.so -> libcudart.so.10.2 lrwxrwxrwx 1 root root 25 Mar 1 2021 libnvrtc-builtins.so -> libnvrtc-builtins.so.10.2 drwxr-xr-x 4 root root 4096 Aug 23 15:34 ../ lrwxrwxrwx 1 root root 15 Mar 1 2021 libnppisu.so -> libnppisu.so.10 lrwxrwxrwx 1 root root 16 Mar 1 2021 libnvrtc.so -> libnvrtc.so.10.2 lrwxrwxrwx 1 root root 22 Mar 1 2021 libnppif.so.10 -> libnppif.so.10.2.1.300 -rw-r--r-- 1 root root 1535464 Mar 1 2021 libcuinj64.so.10.2.300 -rw-r--r-- 1 root root 210524874 Mar 1 2021 libcufft_static_nocallback.a lrwxrwxrwx 1 root root 17 Mar 1 2021 libcublasLt.so -> libcublasLt.so.10 lrwxrwxrwx 1 root root 16 Mar 1 2021 libnvgraph.so -> libnvgraph.so.10 lrwxrwxrwx 1 root root 23 Mar 1 2021 libnppicc.so.10 -> libnppicc.so.10.2.1.300 lrwxrwxrwx 1 root root 15 Mar 1 2021 libnppist.so -> libnppist.so.10 lrwxrwxrwx 1 root root 23 Mar 1 2021 libnvblas.so.10 -> libnvblas.so.10.2.3.300 lrwxrwxrwx 1 root root 17 Mar 1 2021 libcusolver.so -> libcusolver.so.10 -rw-r--r-- 1 root root 490664 Mar 1 2021 libcudart.so.10.2.300 lrwxrwxrwx 1 root root 23 Mar 1 2021 libnppisu.so.10 -> libnppisu.so.10.2.1.300 -rw-r--r-- 1 root root 165012616 Mar 1 2021 libnvgraph.so.10.2.300 -rw-r--r-- 1 root root 141252584 Mar 1 2021 libcusparse.so.10.3.1.300 lrwxrwxrwx 1 root root 22 Mar 1 2021 libnppig.so.10 -> libnppig.so.10.2.1.300 -rw-r--r-- 1 root root 33242 Mar 1 2021 libculibos.a -rw-r--r-- 1 root root 44088 Mar 1 2021 libnvToolsExt.so.1.0.0 lrwxrwxrwx 1 root root 20 Mar 1 2021 libcupti.so.10.2 -> libcupti.so.10.2.175 -rw-r--r-- 1 root root 909274 Mar 1 2021 libmetis_static.a lrwxrwxrwx 1 root root 20 Mar 1 2021 libnvrtc.so.10.2 -> libnvrtc.so.10.2.300 -rw-r--r-- 1 root root 3112480 Mar 1 2021 libnppitc.so.10.2.1.300 -rw-r--r-- 1 root root 10690508 Mar 1 2021 libnpps_static.a lrwxrwxrwx 1 root root 14 Mar 1 2021 libnppig.so -> libnppig.so.10 lrwxrwxrwx 1 root root 25 Mar 1 2021 libcusparse.so.10 -> libcusparse.so.10.3.1.300 -rw-r--r-- 1 root root 62767380 Mar 1 2021 libcurand_static.a -rw-r--r-- 1 root root 149512102 Mar 1 2021 libcusparse_static.a -rw-r--r-- 1 root root 81096256 Mar 1 2021 libcublas.so.10.2.3.300 lrwxrwxrwx 1 root root 29 Mar 1 2021 libnvrtc-builtins.so.10.2 -> libnvrtc-builtins.so.10.2.300 -rw-r--r-- 1 root root 192531512 Mar 1 2021 libcufft_static.a -rw-r--r-- 1 root root 11458 Mar 1 2021 libnppisu_static.a -rw-r--r-- 1 root root 1093680 Mar 1 2021 libnppicom_static.a -rw-r--r-- 1 root root 10762478 Mar 1 2021 libnppidei_static.a -rw-r--r-- 1 root root 62698584 Mar 1 2021 libcurand.so.10.1.2.300 -rw-r--r-- 1 root root 8175688 Mar 1 2021 libnppidei.so.10.2.1.300 -rw-r--r-- 1 root root 31970 Mar 1 2021 libcufftw_static.a -rw-r--r-- 1 root root 33562824 Mar 1 2021 libcublasLt.so.10.2.3.300 lrwxrwxrwx 1 root root 23 Mar 1 2021 libnppist.so.10 -> libnppist.so.10.2.1.300 -rw-r--r-- 1 root root 486576 Mar 1 2021 libnppisu.so.10.2.1.300 -rw-r--r-- 1 root root 4914920 Mar 1 2021 libnppicc.so.10.2.1.300 -rw-r--r-- 1 root root 11509472 Mar 1 2021 libnppial.so.10.2.1.300 lrwxrwxrwx 1 root root 13 Mar 1 2021 libnpps.so -> libnpps.so.10 -rw-r--r-- 1 root root 31432462 Mar 1 2021 libnppig_static.a -rw-r--r-- 1 root root 14410930 Mar 1 2021 libnppial_static.a lrwxrwxrwx 1 root root 14 Mar 1 2021 libcufft.so -> libcufft.so.10 lrwxrwxrwx 1 root root 22 Mar 1 2021 libnppim.so.10 -> libnppim.so.10.2.1.300 lrwxrwxrwx 1 root root 18 Mar 1 2021 libcuinj64.so -> libcuinj64.so.10.2 -rw-r--r-- 1 root root 7396476 Mar 1 2021 libnppim_static.a -rw-r--r-- 1 root root 503192 Mar 1 2021 libcufftw.so.10.1.2.300 -rw-r--r-- 1 root root 58471042 Mar 1 2021 libnppif_static.a lrwxrwxrwx 1 root root 17 Mar 1 2021 libcusparse.so -> libcusparse.so.10 -rw-r--r-- 1 root root 540232 Mar 1 2021 libnvblas.so.10.2.3.300 -rw-r--r-- 1 root root 26846 Mar 1 2021 libnppc_static.a lrwxrwxrwx 1 root root 15 Mar 1 2021 libcurand.so -> libcurand.so.10 -rw-r--r-- 1 root root 1453728 Mar 1 2021 libnppicom.so.10.2.1.300 lrwxrwxrwx 1 root root 23 Mar 1 2021 libcublas.so.10 -> libcublas.so.10.2.3.300 -rw-r--r-- 1 root root 20877336 Mar 1 2021 libnppist.so.10.2.1.300 -rw-r--r-- 1 root root 28761920 Mar 1 2021 libnppig.so.10.2.1.300 -rw-r--r-- 1 root root 3205362 Mar 1 2021 libnppitc_static.a lrwxrwxrwx 1 root root 24 Mar 1 2021 libnppicom.so.10 -> libnppicom.so.10.2.1.300 -rw-r--r-- 1 root root 9539760 Mar 1 2021 libnpps.so.10.2.1.300 lrwxrwxrwx 1 root root 22 Mar 1 2021 libcuinj64.so.10.2 -> libcuinj64.so.10.2.300 lrwxrwxrwx 1 root root 25 Mar 1 2021 libcublasLt.so.10 -> libcublasLt.so.10.2.3.300 lrwxrwxrwx 1 root root 23 Mar 1 2021 libcurand.so.10 -> libcurand.so.10.1.2.300 -rw-r--r-- 1 root root 123895098 Mar 1 2021 libcusolver_static.a lrwxrwxrwx 1 root root 22 Mar 1 2021 libnvToolsExt.so.1 -> libnvToolsExt.so.1.0.0 lrwxrwxrwx 1 root root 24 Mar 1 2021 libnppidei.so.10 -> libnppidei.so.10.2.1.300 -rw-r--r-- 1 root root 54362944 Mar 1 2021 libnppif.so.10.2.1.300 -rw-r--r-- 1 root root 20432800 Mar 1 2021 libnvrtc.so.10.2.300 -rw-r--r-- 1 root root 23399160 Mar 1 2021 libnppist_static.a lrwxrwxrwx 1 root root 21 Mar 1 2021 libcudart.so.10.2 -> libcudart.so.10.2.300 -rw-r--r-- 1 root root 96903266 Mar 1 2021 libcublas_static.a lrwxrwxrwx 1 root root 23 Mar 1 2021 libnppial.so.10 -> libnppial.so.10.2.1.300 lrwxrwxrwx 1 root root 15 Mar 1 2021 libnppitc.so -> libnppitc.so.10 lrwxrwxrwx 1 root root 21 Mar 1 2021 libnpps.so.10 -> libnpps.so.10.2.1.300 -rw-r--r-- 1 root root 7163640 Mar 1 2021 libnppim.so.10.2.1.300 lrwxrwxrwx 1 root root 25 Mar 1 2021 libcusolver.so.10 -> libcusolver.so.10.3.0.300 lrwxrwxrwx 1 root root 23 Mar 1 2021 libcufftw.so.10 -> libcufftw.so.10.1.2.300 lrwxrwxrwx 1 root root 15 Mar 1 2021 libcufftw.so -> libcufftw.so.10 lrwxrwxrwx 1 root root 22 Mar 1 2021 libcufft.so.10 -> libcufft.so.10.1.2.300 lrwxrwxrwx 1 root root 22 Mar 1 2021 libnvgraph.so.10 -> libnvgraph.so.10.2.300 lrwxrwxrwx 1 root root 16 Mar 1 2021 libcupti.so -> libcupti.so.10.2 -rw-r--r-- 1 root root 4526616 Mar 1 2021 libcupti.so.10.2.175 -rw-r--r-- 1 root root 8319056 Mar 1 2021 liblapack_static.a lrwxrwxrwx 1 root root 14 Mar 1 2021 libnppim.so -> libnppim.so.10 -rw-r--r-- 1 root root 888074 Mar 1 2021 libcudart_static.a -rw-r--r-- 1 root root 168141386 Mar 1 2021 libnvgraph_static.a lrwxrwxrwx 1 root root 15 Mar 1 2021 libnvblas.so -> libnvblas.so.10 -rw-r--r-- 1 root root 218927328 Mar 1 2021 libcusolver.so.10.3.0.300 lrwxrwxrwx 1 root root 16 Mar 1 2021 libnppicom.so -> libnppicom.so.10 -rw-r--r-- 1 root root 503184 Mar 1 2021 libnppc.so.10.2.1.300 lrwxrwxrwx 1 root root 15 Mar 1 2021 libnppial.so -> libnppial.so.10 drwxr-xr-x 2 root root 4096 Aug 23 15:49 stubs/ lrwxrwxrwx 1 root root 14 Mar 1 2021 libnppif.so -> libnppif.so.10 -rw-r--r-- 1 root root 1096016 Mar 1 2021 libnvperf_target.so -rw-r--r-- 1 root root 201494704 Mar 1 2021 libcufft.so.10.1.2.300 -rw-r--r-- 1 root root 7430712 Mar 1 2021 libnvperf_host.so lrwxrwxrwx 1 root root 21 Mar 1 2021 libnppc.so.10 -> libnppc.so.10.2.1.300 -rw-r--r-- 1 root root 679636 Mar 1 2021 libcudadevrt.a -rw-r--r-- 1 root root 36011742 Mar 1 2021 libcublasLt_static.a lrwxrwxrwx 1 root root 15 Mar 1 2021 libnppicc.so -> libnppicc.so.10 lrwxrwxrwx 1 root root 13 Mar 1 2021 libnppc.so -> libnppc.so.10 $

dusty-nv commented 1 year ago

@ByerRA it would appear you have the files, so not sure why --runtime nvidia is not finding them. If you have another SD card, I would recommend trying a fresh JetPack image and checking that l4t-base works from the start.

ByerRA commented 1 year ago

I flashed a new card with JetPack and now while I no longer get the errors about the missing cuda libraries I'm now getting the following build error...

[24/35] /usr/local/cuda-10.2/bin/nvcc -DWITH_CUDA -I/torchvision/torchvision/csrc -I/usr/local/lib/python3.6/dist-packages/torch/include -I/usr/local/lib/python3.6/dist-packages/torch/include/torch/csrc/api/include -I/usr/local/lib/python3.6/dist-packages/torch/include/TH -I/usr/local/lib/python3.6/dist-packages/torch/include/THC -I/usr/local/cuda-10.2/include -I/usr/include/python3.6m -c -c /torchvision/torchvision/csrc/ops/cuda/roi_align_kernel.cu -o /torchvision/build/temp.linux-aarch64-3.6/torchvision/torchvision/csrc/ops/cuda/roi_align_kernel.o -DCUDA_NO_HALF_OPERATORS -DCUDA_NO_HALF_CONVERSIONS -DCUDA_NO_BFLOAT16_CONVERSIONS -DCUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=1 -gencode=arch=compute_53,code=sm_53 -gencode=arch=compute_62,code=sm_62 -gencode=arch=compute_72,code=sm_72 -std=c++14 ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 1723, in _run_ninja_build env=env) File "/usr/lib/python3.6/subprocess.py", line 438, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "setup.py", line 508, in 'clean': clean, File "/usr/local/lib/python3.6/dist-packages/setuptools/init.py", line 153, in setup return distutils.core.setup(*attrs) File "/usr/lib/python3.6/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands self.run_command(cmd) File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command cmd_obj.run() File "/usr/local/lib/python3.6/dist-packages/wheel/bdist_wheel.py", line 299, in run self.run_command('build') File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command cmd_obj.run() File "/usr/lib/python3.6/distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/usr/lib/python3.6/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command cmd_obj.run() File "/usr/local/lib/python3.6/dist-packages/setuptools/command/build_ext.py", line 79, in run _build_ext.run(self) File "/usr/local/lib/python3.6/dist-packages/Cython/Distutils/old_build_ext.py", line 186, in run _build_ext.build_ext.run(self) File "/usr/lib/python3.6/distutils/command/build_ext.py", line 339, in run self.build_extensions() File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 735, in build_extensions build_ext.build_extensions(self) File "/usr/local/lib/python3.6/dist-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions _build_ext.build_ext.build_extensions(self) File "/usr/lib/python3.6/distutils/command/build_ext.py", line 448, in build_extensions self._build_extensions_serial() File "/usr/lib/python3.6/distutils/command/build_ext.py", line 473, in _build_extensions_serial self.build_extension(ext) File "/usr/local/lib/python3.6/dist-packages/setuptools/command/build_ext.py", line 202, in build_extension _build_ext.build_extension(self, ext) File "/usr/lib/python3.6/distutils/command/build_ext.py", line 533, in build_extension depends=ext.depends) File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 565, in unix_wrap_ninja_compile with_cuda=with_cuda) File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 1404, in _write_ninja_file_and_compile_objects error_prefix='Error compiling objects for extension') File "/usr/local/lib/python3.6/dist-packages/torch/utils/cpp_extension.py", line 1733, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error compiling objects for extension The command '/bin/sh -c git clone --branch ${TORCHVISION_VERSION} --recursive --depth=1 https://github.com/pytorch/vision torchvision && cd torchvision && git checkout ${TORCHVISION_VERSION} && python3 setup.py bdist_wheel && cp dist/torchvision.whl /opt && pip3 install --no-cache-dir --verbose /opt/torchvision*.whl && cd ../ && rm -rf torchvision' returned a non-zero code: 1 Traceback (most recent call last): File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/rbyer/jetson-containers/jetson_containers/build.py", line 95, in build_container(args.name, args.packages, args.base, args.build_flags, args.simulate, args.skip_tests, args.test_only, args.push) File "/home/rbyer/jetson-containers/jetson_containers/container.py", line 128, in build_container status = subprocess.run(cmd.replace(NEWLINE, ' '), executable='/bin/bash', shell=True, check=True)
File "/usr/lib/python3.6/subprocess.py", line 438, in run output=stdout, stderr=stderr) subprocess.CalledProcessError: Command 'docker build --network=host --tag jetson_l4t-ml:r32.7.4-torchvision --file /home/rbyer/jetson-containers/packages/pytorch/torchvision/Dockerfile --build-arg BASE_IMAGE=jetson_l4t-ml:r32.7.4-pytorch --build-arg TORCHVISION_VERSION="v0.11.1" --build-arg TORCH_CUDA_ARCH_LIST="5.3;6.2;7.2" /home/rbyer/jetson-containers/packages/pytorch/torchvision 2>&1 | tee /home/rbyer/jetson-containers/logs/20230922_132123/build/jetson_l4t-ml_r32.7.4-torchvision.txt; exit ${PIPESTATUS[0]}' returned non-zero exit status 1.

dusty-nv commented 1 year ago

@ByerRA there's not an actual compilation error in there, I think it ran out of memory. Can you try mounting more swap?

https://github.com/dusty-nv/jetson-containers/blob/master/docs/setup.md#mounting-swap