dusty-nv / jetson-containers

Machine Learning Containers for NVIDIA Jetson and JetPack-L4T
MIT License
1.93k stars 422 forks source link

Jetpack 5 CUDA 11.6+ #407

Closed IamShubhamGupto closed 4 months ago

IamShubhamGupto commented 4 months ago

Hey Dusty,

I want to develop a repository on the AGX Xavier running Jetpack 5.1.3 using https://github.com/state-spaces/mamba. However, it has a hard requirement of CUDA 11.6+

When I build a container, how do I give it a specific CUDA version?

Thank you

IamShubhamGupto commented 4 months ago

Hey @dusty-nv just bumping this up, please let me know how to proceed building containers for this.

I upgraded the cuda version on the age Xavier from 11.4 to 11.8 but when I rebuild the containers, they are still 11.4. Let me know what's the right way for the docker containers to run 11.8

IamShubhamGupto commented 4 months ago

The current stacktrace when i try to use an existing image of cuda 11.8 from ngc

xavier01@ubuntu:~/Documents/workspace/jetson-containers$ sudo ./build.sh --base nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04 --name cuda118 torch:2.1 torchvision tensorrt onnx onnxruntime 
Namespace(base='nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04', build_flags='', list_packages=False, logs='', multiple=False, name='cuda118', no_github_api=False, package_dirs=[''], packages=['torch:2.1', 'torchvision', 'tensorrt', 'onnx', 'onnxruntime'], push='', show_packages=False, simulate=False, skip_errors=False, skip_packages=[''], skip_tests=[''], test_only=[''], verbose=False)
-- L4T_VERSION=35.5.0
-- JETPACK_VERSION=5.1
-- CUDA_VERSION=11.8.89
-- LSB_RELEASE=20.04 (focal)
fatal: invalid reference: origin/dev
ERROR:root:failed to update container registry cache from GitHub (/home/xavier01/Documents/workspace/jetson-containers/data/containers.json)
ERROR:root:return code 128 > cd /home/xavier01/Documents/workspace/jetson-containers && git fetch origin dev --quiet && git checkout --quiet origin/dev -- data/containers.json
-- Building containers  ['build-essential', 'cuda', 'cudnn', 'python', 'tensorrt', 'numpy', 'cmake', 'onnx', 'torch:2.1', 'pytorch', 'torchvision', 'onnxruntime']
-- Building container cuda118:r35.5.0-build-essential

sudo DOCKER_BUILDKIT=0 docker build --network=host --tag cuda118:r35.5.0-build-essential \
--file /home/xavier01/Documents/workspace/jetson-containers/packages/build-essential/Dockerfile \
--build-arg BASE_IMAGE=nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04 \
/home/xavier01/Documents/workspace/jetson-containers/packages/build-essential \
2>&1 | tee /home/xavier01/Documents/workspace/jetson-containers/logs/20240311_014858/build/cuda118_r35.5.0-build-essential.txt; exit ${PIPESTATUS[0]}

DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
            BuildKit is currently disabled; enable it by removing the DOCKER_BUILDKIT=0
            environment-variable.

Sending build context to Docker daemon  13.82kB
Step 1/5 : ARG BASE_IMAGE
Step 2/5 : FROM ${BASE_IMAGE}
 ---> 886bbfc5e8c5
Step 3/5 : ENV DEBIAN_FRONTEND=noninteractive
 ---> Using cache
 ---> a36a36041190
Step 4/5 : RUN apt-get update &&     apt-get install -y --no-install-recommends           build-essential         software-properties-common          apt-transport-https         ca-certificates         lsb-release         pkg-config          gnupg           git         wget        curl        nano        zip         unzip     && rm -rf /var/lib/apt/lists/*     && apt-get clean
 ---> Running in 22d9d405d3a2
failed to create task for container: failed to create shim task: OCI runtime create failed: nvidia-container-runtime did not terminate successfully: exit status 1: unknown
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/xavier01/Documents/workspace/jetson-containers/jetson_containers/build.py", line 102, in <module>
    build_container(args.name, args.packages, args.base, args.build_flags, args.simulate, args.skip_tests, args.test_only, args.push, args.no_github_api)
  File "/home/xavier01/Documents/workspace/jetson-containers/jetson_containers/container.py", line 143, in build_container
    status = subprocess.run(cmd.replace(_NEWLINE_, ' '), executable='/bin/bash', shell=True, check=True)  
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'sudo DOCKER_BUILDKIT=0 docker build --network=host --tag cuda118:r35.5.0-build-essential --file /home/xavier01/Documents/workspace/jetson-containers/packages/build-essential/Dockerfile --build-arg BASE_IMAGE=nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04 /home/xavier01/Documents/workspace/jetson-containers/packages/build-essential 2>&1 | tee /home/xavier01/Documents/workspace/jetson-containers/logs/20240311_014858/build/cuda118_r35.5.0-build-essential.txt; exit ${PIPESTATUS[0]}' returned non-zero exit status 1.
xavier01@ubuntu:~/Documents/workspace/jetson-containers$ 
IamShubhamGupto commented 4 months ago

UPDATE I was able to build an image using https://github.com/dusty-nv/jetson-containers/issues/258 and publishing it to shubhamgupto/jp5.1-cuda11.8

Currently

sudo ./build.sh --base shubhamgupto/jp5.1-cuda11.8 --name cuda118 tensorrt torch:2.1 torchvision  onnx onnxruntime 

logs

Successfully built c4205a6c747d
Successfully tagged cuda118:r35.5.0-python
-- Tagging container cuda118:r35.5.0-python -> cuda118:r35.5.0-tensorrt
sudo docker tag cuda118:r35.5.0-python cuda118:r35.5.0-tensorrt

-- Testing container cuda118:r35.5.0-tensorrt (tensorrt/test.sh)

sudo docker run -t --rm --runtime=nvidia --network=host \
--volume /home/xavier01/Documents/workspace/jetson-containers/packages/tensorrt:/test \
--volume /home/xavier01/Documents/workspace/jetson-containers/data:/data \
--workdir /test \
cuda118:r35.5.0-tensorrt \
/bin/bash -c '/bin/bash test.sh' \
2>&1 | tee /home/xavier01/Documents/workspace/jetson-containers/logs/20240311_021045/test/cuda118_r35.5.0-tensorrt_test.sh.txt; exit ${PIPESTATUS[0]}

test.sh: line 3: /usr/src/tensorrt/bin/trtexec: No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'tensorrt'
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/xavier01/Documents/workspace/jetson-containers/jetson_containers/build.py", line 102, in <module>
    build_container(args.name, args.packages, args.base, args.build_flags, args.simulate, args.skip_tests, args.test_only, args.push, args.no_github_api)
  File "/home/xavier01/Documents/workspace/jetson-containers/jetson_containers/container.py", line 150, in build_container
    test_container(container_name, pkg, simulate)
  File "/home/xavier01/Documents/workspace/jetson-containers/jetson_containers/container.py", line 322, in test_container
    status = subprocess.run(cmd.replace(_NEWLINE_, ' '), executable='/bin/bash', shell=True, check=True)
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'sudo docker run -t --rm --runtime=nvidia --network=host --volume /home/xavier01/Documents/workspace/jetson-containers/packages/tensorrt:/test --volume /home/xavier01/Documents/workspace/jetson-containers/data:/data --workdir /test cuda118:r35.5.0-tensorrt /bin/bash -c '/bin/bash test.sh' 2>&1 | tee /home/xavier01/Documents/workspace/jetson-containers/logs/20240311_021045/test/cuda118_r35.5.0-tensorrt_test.sh.txt; exit ${PIPESTATUS[0]}' returned non-zero exit status 1.

I find this a bit confusing as it is supposed to install tensorrt for me, instead just complains it does not exist. Should tensorrt already be present in the base image?

dusty-nv commented 4 months ago

@IamShubhamGupto you would need to install it into your base image, but the TensorRT that's out is for the default version of CUDA that comes with JetPack

On JetPack 5, I just use l4t-jetpack base container, which already has tensorrt On JetPack 6, I install specific versions of CUDA (see the config.py files in the container packages for cuda, cudnn, tensorrt, ect)

IamShubhamGupto commented 4 months ago

Hey @dusty-nv Thank you for the clarification. I am using l4t-base and then installing cuda 11.8 so there are no conflicts with the default cuda installations in l4t-jetpack.

Would it be easier for to just use l4t-jetpack and reinstall the right cuda version im interested in? That way tensorrt is already installed and I just have to update the path to cuda 11.8

dusty-nv commented 4 months ago

I'm not sure, I haven't done that before. I would probably start from l4t-base, since in docker there is no 'deleting' previous layers

If you look at my CUDA/cuDNN/TensorRT dockerfiles for jetpack 6, you see I install them from debian packages that I download

IamShubhamGupto commented 4 months ago

@dusty-nv thanks, going with l4t-base seems to be the correct way.

I made an image shubhamgupto/jp5.1-cuda11.8-cudnn9-trt8.5 with the latest cudnn9 but when I pass it as a base image to build, it's looking for cudnn 8 samples. ive ran the container and checked /usr/src/cudnn_samples_v8/ and its empty.

how do I tell build.sh to use /usr/src/cudnn_samples_v9/? thank you

dusty-nv commented 4 months ago

You will need to change the reference to /usr/src/cudnn_samples/v8 in https://github.com/dusty-nv/jetson-containers/blob/master/packages/cuda/cudnn/test.sh

Depending on what other packages you intend to use, you may need to make other modifications, rebuild pytorch wheel for different CUDA, ect.

IamShubhamGupto commented 4 months ago

@dusty-nv thanks for pointing it out. I noticed Tensorrt 8.5 is incompatible with cudnn9 so im sticking to cudnn8.6

reference - https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html