Closed IamShubhamGupto closed 4 months ago
Hey @dusty-nv just bumping this up, please let me know how to proceed building containers for this.
I upgraded the cuda version on the age Xavier from 11.4 to 11.8 but when I rebuild the containers, they are still 11.4. Let me know what's the right way for the docker containers to run 11.8
The current stacktrace when i try to use an existing image of cuda 11.8 from ngc
xavier01@ubuntu:~/Documents/workspace/jetson-containers$ sudo ./build.sh --base nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04 --name cuda118 torch:2.1 torchvision tensorrt onnx onnxruntime
Namespace(base='nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04', build_flags='', list_packages=False, logs='', multiple=False, name='cuda118', no_github_api=False, package_dirs=[''], packages=['torch:2.1', 'torchvision', 'tensorrt', 'onnx', 'onnxruntime'], push='', show_packages=False, simulate=False, skip_errors=False, skip_packages=[''], skip_tests=[''], test_only=[''], verbose=False)
-- L4T_VERSION=35.5.0
-- JETPACK_VERSION=5.1
-- CUDA_VERSION=11.8.89
-- LSB_RELEASE=20.04 (focal)
fatal: invalid reference: origin/dev
ERROR:root:failed to update container registry cache from GitHub (/home/xavier01/Documents/workspace/jetson-containers/data/containers.json)
ERROR:root:return code 128 > cd /home/xavier01/Documents/workspace/jetson-containers && git fetch origin dev --quiet && git checkout --quiet origin/dev -- data/containers.json
-- Building containers ['build-essential', 'cuda', 'cudnn', 'python', 'tensorrt', 'numpy', 'cmake', 'onnx', 'torch:2.1', 'pytorch', 'torchvision', 'onnxruntime']
-- Building container cuda118:r35.5.0-build-essential
sudo DOCKER_BUILDKIT=0 docker build --network=host --tag cuda118:r35.5.0-build-essential \
--file /home/xavier01/Documents/workspace/jetson-containers/packages/build-essential/Dockerfile \
--build-arg BASE_IMAGE=nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04 \
/home/xavier01/Documents/workspace/jetson-containers/packages/build-essential \
2>&1 | tee /home/xavier01/Documents/workspace/jetson-containers/logs/20240311_014858/build/cuda118_r35.5.0-build-essential.txt; exit ${PIPESTATUS[0]}
DEPRECATED: The legacy builder is deprecated and will be removed in a future release.
BuildKit is currently disabled; enable it by removing the DOCKER_BUILDKIT=0
environment-variable.
Sending build context to Docker daemon 13.82kB
Step 1/5 : ARG BASE_IMAGE
Step 2/5 : FROM ${BASE_IMAGE}
---> 886bbfc5e8c5
Step 3/5 : ENV DEBIAN_FRONTEND=noninteractive
---> Using cache
---> a36a36041190
Step 4/5 : RUN apt-get update && apt-get install -y --no-install-recommends build-essential software-properties-common apt-transport-https ca-certificates lsb-release pkg-config gnupg git wget curl nano zip unzip && rm -rf /var/lib/apt/lists/* && apt-get clean
---> Running in 22d9d405d3a2
failed to create task for container: failed to create shim task: OCI runtime create failed: nvidia-container-runtime did not terminate successfully: exit status 1: unknown
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/xavier01/Documents/workspace/jetson-containers/jetson_containers/build.py", line 102, in <module>
build_container(args.name, args.packages, args.base, args.build_flags, args.simulate, args.skip_tests, args.test_only, args.push, args.no_github_api)
File "/home/xavier01/Documents/workspace/jetson-containers/jetson_containers/container.py", line 143, in build_container
status = subprocess.run(cmd.replace(_NEWLINE_, ' '), executable='/bin/bash', shell=True, check=True)
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'sudo DOCKER_BUILDKIT=0 docker build --network=host --tag cuda118:r35.5.0-build-essential --file /home/xavier01/Documents/workspace/jetson-containers/packages/build-essential/Dockerfile --build-arg BASE_IMAGE=nvcr.io/nvidia/cuda:11.8.0-devel-ubuntu20.04 /home/xavier01/Documents/workspace/jetson-containers/packages/build-essential 2>&1 | tee /home/xavier01/Documents/workspace/jetson-containers/logs/20240311_014858/build/cuda118_r35.5.0-build-essential.txt; exit ${PIPESTATUS[0]}' returned non-zero exit status 1.
xavier01@ubuntu:~/Documents/workspace/jetson-containers$
UPDATE
I was able to build an image using https://github.com/dusty-nv/jetson-containers/issues/258 and publishing it to shubhamgupto/jp5.1-cuda11.8
Currently
sudo ./build.sh --base shubhamgupto/jp5.1-cuda11.8 --name cuda118 tensorrt torch:2.1 torchvision onnx onnxruntime
logs
Successfully built c4205a6c747d
Successfully tagged cuda118:r35.5.0-python
-- Tagging container cuda118:r35.5.0-python -> cuda118:r35.5.0-tensorrt
sudo docker tag cuda118:r35.5.0-python cuda118:r35.5.0-tensorrt
-- Testing container cuda118:r35.5.0-tensorrt (tensorrt/test.sh)
sudo docker run -t --rm --runtime=nvidia --network=host \
--volume /home/xavier01/Documents/workspace/jetson-containers/packages/tensorrt:/test \
--volume /home/xavier01/Documents/workspace/jetson-containers/data:/data \
--workdir /test \
cuda118:r35.5.0-tensorrt \
/bin/bash -c '/bin/bash test.sh' \
2>&1 | tee /home/xavier01/Documents/workspace/jetson-containers/logs/20240311_021045/test/cuda118_r35.5.0-tensorrt_test.sh.txt; exit ${PIPESTATUS[0]}
test.sh: line 3: /usr/src/tensorrt/bin/trtexec: No such file or directory
Traceback (most recent call last):
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'tensorrt'
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/xavier01/Documents/workspace/jetson-containers/jetson_containers/build.py", line 102, in <module>
build_container(args.name, args.packages, args.base, args.build_flags, args.simulate, args.skip_tests, args.test_only, args.push, args.no_github_api)
File "/home/xavier01/Documents/workspace/jetson-containers/jetson_containers/container.py", line 150, in build_container
test_container(container_name, pkg, simulate)
File "/home/xavier01/Documents/workspace/jetson-containers/jetson_containers/container.py", line 322, in test_container
status = subprocess.run(cmd.replace(_NEWLINE_, ' '), executable='/bin/bash', shell=True, check=True)
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'sudo docker run -t --rm --runtime=nvidia --network=host --volume /home/xavier01/Documents/workspace/jetson-containers/packages/tensorrt:/test --volume /home/xavier01/Documents/workspace/jetson-containers/data:/data --workdir /test cuda118:r35.5.0-tensorrt /bin/bash -c '/bin/bash test.sh' 2>&1 | tee /home/xavier01/Documents/workspace/jetson-containers/logs/20240311_021045/test/cuda118_r35.5.0-tensorrt_test.sh.txt; exit ${PIPESTATUS[0]}' returned non-zero exit status 1.
I find this a bit confusing as it is supposed to install tensorrt for me, instead just complains it does not exist. Should tensorrt already be present in the base image?
@IamShubhamGupto you would need to install it into your base image, but the TensorRT that's out is for the default version of CUDA that comes with JetPack
On JetPack 5, I just use l4t-jetpack base container, which already has tensorrt On JetPack 6, I install specific versions of CUDA (see the config.py files in the container packages for cuda, cudnn, tensorrt, ect)
Hey @dusty-nv
Thank you for the clarification. I am using l4t-base
and then installing cuda 11.8 so there are no conflicts with the default cuda installations in l4t-jetpack
.
Would it be easier for to just use l4t-jetpack
and reinstall the right cuda version im interested in? That way tensorrt is already installed and I just have to update the path to cuda 11.8
I'm not sure, I haven't done that before. I would probably start from l4t-base, since in docker there is no 'deleting' previous layers
If you look at my CUDA/cuDNN/TensorRT dockerfiles for jetpack 6, you see I install them from debian packages that I download
@dusty-nv thanks, going with l4t-base
seems to be the correct way.
I made an image shubhamgupto/jp5.1-cuda11.8-cudnn9-trt8.5
with the latest cudnn9 but when I pass it as a base image to build, it's looking for cudnn 8 samples. ive ran the container and checked /usr/src/cudnn_samples_v8/
and its empty.
how do I tell build.sh
to use /usr/src/cudnn_samples_v9/
? thank you
You will need to change the reference to /usr/src/cudnn_samples/v8
in https://github.com/dusty-nv/jetson-containers/blob/master/packages/cuda/cudnn/test.sh
Depending on what other packages you intend to use, you may need to make other modifications, rebuild pytorch wheel for different CUDA, ect.
@dusty-nv thanks for pointing it out. I noticed Tensorrt 8.5 is incompatible with cudnn9 so im sticking to cudnn8.6
reference - https://docs.nvidia.com/deeplearning/tensorrt/support-matrix/index.html
Hey Dusty,
I want to develop a repository on the AGX Xavier running Jetpack 5.1.3 using https://github.com/state-spaces/mamba. However, it has a hard requirement of CUDA 11.6+
When I build a container, how do I give it a specific CUDA version?
Thank you