Open shahizat opened 1 year ago
Hi @shahizat ,
We do not officially support Orin yet but we have colleagues (in our automotive division) who are working on enabling TensorRT-LLM on Orin. I’ll ask them if they can provide you with feedback. Also, if you want to give it a try, we should be able to help you (as much as other priorities allow us).
Thanks, Julien
@jdemouth-nvidia TensorRT-LLM use CUDA 12 default,but jetson orin only support cuda 11,is there any possible to run TensorRT-LLM on Nvidia AGX Orin development kit ?
@shahizat have you successed run TensorRT-LLM on the Nvidia AGX Orin development kit ?
@jdemouth-nvidia can I use 'nvcr.io/nvidia/l4t-tensorrt:r8.5.2.2-devel' instead of 'make -C docker release_run' in TensorRT-LLM building ?
Looking for Jetson support, found IGX Orin, yet in the article it didn't mention using TensorRT-LLM on Orin's iGPU, instead it presses on the dGPU.
also need agx orin support. @jdemouth-nvidia any progress on it?
+mark
@jdemouth-nvidia Do you have any plans or roadmap?
+mark
Hi all. Jetson / Orin support is still pending & will not be official in the next few releases. Once we have a more concrete timeline we will update here.
i can install cuda-12.2 on jetson, but i can't install tensorrt-libs anyways(when run step: python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt ). pls. help me
Hi @whk6688 can you please look at #488? As of right now it is not formally supported
when i convert llama model, said:OSError: libcuda.so.1: cannot open shared object file: No such file or directory my platform is orin, so i do not have gpu. Is there any parameter for convering model without gpu? thanks
when i convert llama model, said:OSError: libcuda.so.1: cannot open shared object file: No such file or directory my platform is orin, so i do not have gpu. Is there any parameter for convering model without gpu? thanks
Try to find the libcuda.so.1 in your system, e.g. /usr/lib/libcuda.so.1
, and now setup the env var:
export LD_PRELOAD='/usr/lib/libcuda.so.1'
Maybe it can work.
OK.
In fact , I run: make -C docker release_run.
ERROR:
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'csv'
invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead.: unknown
For orin, how to avoid the error?
Hi, all +mark
is there any update?
Hi, all Is there any other solution to accelerate LLM inference besides Tensorrt LLM currently in Nvidia jetson agx orin dk ? thanks
Hi @WangFengtu1996, I highly recommend you to check it out the LLM jetson projects by @dusty-nv, especially the implementation of inference via MLC LLM(much faster than llama.cpp). You can find tutorials here https://www.jetson-ai-lab.com/tutorial-intro.html
@shahizat have you successed run TensorRT-LLM on the Nvidia AGX Orin development kit ? same question
@shahizat have you successed run TensorRT-LLM on the Nvidia AGX Orin development kit ?
No, TRT-LLM isn't available for Jetson yet (there are other dependencies on newer version of TensorRT and such), and we hope to have a preview release closer to the middle of this year. Until then, I concur with @shahizat to use MLC which is also highly optimized:
When TRT-LLM is released for Jetson, that local_llm
package that I use will support it as well. That's the wrapper I use for running optimized LLM APIs in-process (for efficient video streaming for VLMs and handling of large embeddings)
Oh that's a good news @dusty-nv ! I tried to compile tensorrt-llm
on my Orin with some success using jetson-containers
stack, based on tensorrt 9.3.0.1
(which installed fine on Orin):
21:53:57 CUDA_ARCHS: 87-real
21:53:57 + TRT_CUDA_VERSION=12.2
21:53:57 + RELEASE_URL_TRT=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.3.0/tensorrt-9.3.0.1.ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz
[...]
21:53:57 -- Building for TensorRT version: 9.3.0, library version: 9
[...]
21:53:57 + pip3 show tensorrt_llm
21:53:59 Name: tensorrt_llm
21:53:59 Version: 0.9.0.dev2024031900
21:53:59 Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models
21:53:59 Home-page: https://github.com/NVIDIA/TensorRT-LLM
21:53:59 Author: NVIDIA Corporation
21:53:59 Author-email:
21:53:59 License: Apache License 2.0
21:53:59 Location: /usr/local/lib/python3.10/dist-packages
[...]
21:53:59 [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024031900
But it was not very stable when testing (probably I just need to research a bit more how to run proper quantization tests):
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:batch_size_cache, ranges for this dim are:{(1, 128, 256)}
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:beam_width, ranges for this dim are:{(1, 1, 1)}
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:max_seq_len, ranges for this dim are:{(1, 356, 712)}
17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [I] Build TensorRT engine gpt_350m_float16_tp1_rank0.engine
17:09:12 [03/21/2024-16:09:12] [TRT] [E] 9: [standardEngineBuilder.cpp::buildEngine::2265] Error Code 9: Internal Error (Networks with FP8 precision require hardware with FP8 support.)
17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [E] Engine building failed, please check the error log.
17:09:12 Traceback (most recent call last):
17:09:12 File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 412, in <module>
17:09:12 main(args)
17:09:12 File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 299, in main
17:09:12 benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
17:09:12 File "/opt/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 102, in __init__
17:09:12 engine_buffer, build_time = build_gpt(args)
17:09:12 File "/opt/tensorrt_llm/benchmarks/python/build.py", line 843, in build_gpt
17:09:12 assert engine is not None, f'Failed to build engine for rank {runtime_rank}'
17:09:12 AssertionError: Failed to build engine for rank 0
With below test case:
#!/usr/bin/env bash
set -ex
mkdir -p /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m
echo "testing python benchmark..."
python3 /opt/tensorrt_llm/benchmarks/python/benchmark.py \
-m gpt_350m \
--mode plugin \
--batch_size "1;8;64" \
--input_output_len "60,20;128,20" \
--log_level verbose \
--output_dir /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m \
--quantization fp8 \
--enable_cuda_graph \
--strongly_typed
echo "python benchmark OK"
tensorrt_llm
Dockerfile```Dockerfile #--- # name: tensorrt_llm # group: llm # config: config.py # depends: [python, pytorch, optimum, tensorrt, tritonserver] # test: [test.py, test_python_benchmark.sh, test_cpp_benchmark.sh] # requires: '>=35' # notes: The `tensorrt-llm` wheel that's built is saved in the container under `/opt`. https://zhuanlan.zhihu.com/p/663915644 #--- ARG BASE_IMAGE FROM ${BASE_IMAGE} ARG TENSORRT_LLM_BRANCH \ TORCH_CUDA_ARCH_LIST \ CUDA_ARCHS \ CUDA_VERSION \ CUDA_VERSION_MAJOR \ CUDA_VERSION_MINOR \ TRT_TARGETARCH="aarch64" \ SRC_DIR="/tmp/TensorRT-LLM" \ DIST_DIR="/opt/tensorrt_llm" \ CPP_BUILD_DIR="/opt/tensorrt_llm/cpp/build" # Install build dependencies and clone repository RUN set -ex \ && git clone --branch=${TENSORRT_LLM_BRANCH} --depth=1 https://github.com/NVIDIA/TensorRT-LLM.git ${SRC_DIR} \ && git -C ${SRC_DIR} submodule update --init --recursive \ && git -C ${SRC_DIR} lfs pull # Apply sed commands RUN set -ex \ && sed -i \ -e 's|^torch.*|torch|g' \ -e 's|^tensorrt.*|tensorrt|g' \ -e 's|^transformers.*|transformers|g' \ -e 's|^sentencepiece.*|sentencepiece|g' \ -e 's|^diffusers.*|diffusers|g' \ -e 's|^accelerate.*|accelerate|g' \ ${SRC_DIR}/requirements.txt \ && sed -i \ -e 's|${NCCL_LIB}||g' \ ${SRC_DIR}/cpp/tensorrt_llm/CMakeLists.txt \ ${SRC_DIR}/cpp/tensorrt_llm/plugins/CMakeLists.txt \ && sed -i \ -e "s|CUDA_VER=\"[^\"]*\"|CUDA_VER=\"$CUDA_VERSION_MAJOR.$CUDA_VERSION_MINOR\"|g" \ -e 's|^ install_ubuntu_requirements||g' \ ${SRC_DIR}/docker/common/install_tensorrt.sh \ && sed -i '96d' ${SRC_DIR}/docker/common/install_tensorrt.sh \ \ # Install TensorRT 9.x \ && chmod +x ${SRC_DIR}/docker/common/install_*.sh \ && ${SRC_DIR}/docker/common/install_tensorrt.sh ENV LD_LIBRARY_PATH=/usr/local/tensorrt/lib:${LD_LIBRARY_PATH} RUN set -ex \ # Build TensorRT-LLM \ && echo "CUDA_VERSION: ${CUDA_VERSION}" \ && echo "CUDA_ARCHS: ${CUDA_ARCHS}" \ && ${SRC_DIR}/docker/common/install_polygraphy.sh \ && ${SRC_DIR}/docker/common/install_mpi4py.sh \ && python3 ${SRC_DIR}/scripts/build_wheel.py \ --clean \ --build_type Release \ --cuda_architectures "${CUDA_ARCHS}" \ --build_dir ${CPP_BUILD_DIR} \ --dist_dir /opt \ --trt_root /usr/local/tensorrt \ --extra-cmake-vars "ENABLE_MULTI_DEVICE=OFF" \ --benchmarks \ --python_bindings RUN set -ex \ # Copy necessary files \ && cp -r ${SRC_DIR}/cpp/include ${DIST_DIR}/include \ && cp -r ${SRC_DIR}/benchmarks ${DIST_DIR}/benchmarks \ && cp ${CPP_BUILD_DIR}/benchmarks/bertBenchmark ${DIST_DIR}/benchmarks/cpp/ \ && cp ${CPP_BUILD_DIR}/benchmarks/gptManagerBenchmark ${DIST_DIR}/benchmarks/cpp/ \ && cp ${CPP_BUILD_DIR}/benchmarks/gptSessionBenchmark ${DIST_DIR}/benchmarks/cpp/ \ && cp -r ${SRC_DIR}/docs ${DIST_DIR}/docs \ && cp -r ${SRC_DIR}/examples ${DIST_DIR}/examples \ && chmod -R a+w ${DIST_DIR}/examples \ \ # Install TensorRT-LLM package \ && pip3 install --no-cache-dir --verbose /opt/tensorrt_llm*.whl --extra-index-url https://pypi.nvidia.com \ \ # Symlink shared libraries \ && ln -sv $(python3 -c 'import site; print(f"{site.getsitepackages()[0]}/tensorrt_llm/libs")') ${DIST_DIR}/lib \ && test -f ${DIST_DIR}/lib/libnvinfer_plugin_tensorrt_llm.so \ && ln -sv ${DIST_DIR}/lib/libnvinfer_plugin_tensorrt_llm.so ${DIST_DIR}/lib/libnvinfer_plugin_tensorrt_llm.so.9 \ && echo "${DIST_DIR}/lib" > /etc/ld.so.conf.d/tensorrt_llm.conf \ && ldconfig -v | grep nvinfer \ \ # test \ && pip3 show tensorrt_llm \ && python3 -c 'import tensorrt_llm' \ \ # Cleanup unnecessary files \ && rm -rfv \ ${SRC_DIR} \ /opt/*.whl \ ${DIST_DIR}benchmarks/cpp/bertBenchmark.cpp \ ${DIST_DIR}benchmarks/cpp/gptManagerBenchmark.cpp \ ${DIST_DIR}benchmarks/cpp/gptSessionBenchmark.cpp \ ${DIST_DIR}benchmarks/cpp/CMakeLists.txt ```
Instead of suffering I will wait then a bit for full support on Jetson 👏
+mark
Oh that's a good news @dusty-nv ! I tried to compile
tensorrt-llm
on my Orin with some success usingjetson-containers
stack, based ontensorrt 9.3.0.1
(which installed fine on Orin):21:53:57 CUDA_ARCHS: 87-real 21:53:57 + TRT_CUDA_VERSION=12.2 21:53:57 + RELEASE_URL_TRT=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.3.0/tensorrt-9.3.0.1.ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz [...] 21:53:57 -- Building for TensorRT version: 9.3.0, library version: 9 [...] 21:53:57 + pip3 show tensorrt_llm 21:53:59 Name: tensorrt_llm 21:53:59 Version: 0.9.0.dev2024031900 21:53:59 Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models 21:53:59 Home-page: https://github.com/NVIDIA/TensorRT-LLM 21:53:59 Author: NVIDIA Corporation 21:53:59 Author-email: 21:53:59 License: Apache License 2.0 21:53:59 Location: /usr/local/lib/python3.10/dist-packages [...] 21:53:59 [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024031900
But it was not very stable when testing (probably I just need to research a bit more how to run proper quantization tests):
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:batch_size_cache, ranges for this dim are:{(1, 128, 256)} 17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:beam_width, ranges for this dim are:{(1, 1, 1)} 17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:max_seq_len, ranges for this dim are:{(1, 356, 712)} 17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [I] Build TensorRT engine gpt_350m_float16_tp1_rank0.engine 17:09:12 [03/21/2024-16:09:12] [TRT] [E] 9: [standardEngineBuilder.cpp::buildEngine::2265] Error Code 9: Internal Error (Networks with FP8 precision require hardware with FP8 support.) 17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [E] Engine building failed, please check the error log. 17:09:12 Traceback (most recent call last): 17:09:12 File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 412, in <module> 17:09:12 main(args) 17:09:12 File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 299, in main 17:09:12 benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options, 17:09:12 File "/opt/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 102, in __init__ 17:09:12 engine_buffer, build_time = build_gpt(args) 17:09:12 File "/opt/tensorrt_llm/benchmarks/python/build.py", line 843, in build_gpt 17:09:12 assert engine is not None, f'Failed to build engine for rank {runtime_rank}' 17:09:12 AssertionError: Failed to build engine for rank 0
With below test case:
#!/usr/bin/env bash set -ex mkdir -p /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m echo "testing python benchmark..." python3 /opt/tensorrt_llm/benchmarks/python/benchmark.py \ -m gpt_350m \ --mode plugin \ --batch_size "1;8;64" \ --input_output_len "60,20;128,20" \ --log_level verbose \ --output_dir /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m \ --quantization fp8 \ --enable_cuda_graph \ --strongly_typed echo "python benchmark OK"
tensorrt_llm
Dockerfile Instead of suffering I will wait then a bit for full support on Jetson 👏
Did you try with tensorRT 10? Official versions is with jetpack 6 GA
I skipped that, @johnnynunez in favour of Home Assistant & its Voice Assistant Pipeline project on Jetson :)
Beside that If @dusty-nv says that we need wait a bit, then theres not much options :) Maybe you, @johnnynunez wanna try some experiments with it? Im happy to share my efforts here: https://github.com/ms1design/jetson-containers/tree/feature/tensorrt-llm-container
@johnnynunez those TRT9 installer is for ARM SBSA and won't actually work on Jetson (@ms1design and I tried it)
I did get it building against TRT10, but not working yet - it is WIP with TRT-LLM team. For now use MLC through NanoLLM (https://www.jetson-ai-lab.com/tutorial_nano-llm.html) and when TRT-LLM is working on Jetson, I will add it as a backend to NanoLLM.
TensorRT 10 GA is out
+mark
TensorRT 10 GA is out
I succeeded to build tensorRT-LLM on JP6.0 GA for Orin. Thanks for sharing the info.
TensorRT 10 GA is out
I succeeded to build tensorRT-LLM on JP6.0 GA for Orin. Thanks for sharing the info.
is it working well? I didn't try it
TensorRT 10 GA is out
I succeeded to build tensorRT-LLM on JP6.0 GA for Orin. Thanks for sharing the info.
is it working well? I didn't try it
unfortunately no. I tried to build from source for both 0.9.0 and 0.11.0dev. I can import tensorrt_llm in python without any errors but fail to convert any trt models. One of the reasons is that orin does not support nvidia-smi
so that NVML cannot read out sys info. see here. I think it's better to wait for official dockers.
Hi,
The nvidia jetson orin nano supports jetpack 6 and a has cuda 12. I would like to know if it supports running TensorRT-LLM on the NVIDIA Jetson Orin Nano Developer Kit.
Hi, is there any progress on it?
assigned to @laikhtewari
Hi, is there any progress on it?
Yes, it is working https://www.jetson-ai-lab.com/tensorrt_llm.html
Now I hope parity with main versions
Dear Nvidia Team,
I would like to request support for running TensorRT-LLM on the Nvidia AGX Orin development kit.
Thank you!
Best regards, Shakhizat