NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.76k stars 1k forks source link

Nvidia Jetson device Support #62

Open shahizat opened 1 year ago

shahizat commented 1 year ago

Dear Nvidia Team,

I would like to request support for running TensorRT-LLM on the Nvidia AGX Orin development kit.

Thank you!

Best regards, Shakhizat

jdemouth-nvidia commented 1 year ago

Hi @shahizat ,

We do not officially support Orin yet but we have colleagues (in our automotive division) who are working on enabling TensorRT-LLM on Orin. I’ll ask them if they can provide you with feedback. Also, if you want to give it a try, we should be able to help you (as much as other priorities allow us).

Thanks, Julien

liuzhili-ya commented 1 year ago

@jdemouth-nvidia TensorRT-LLM use CUDA 12 default,but jetson orin only support cuda 11,is there any possible to run TensorRT-LLM on Nvidia AGX Orin development kit ?

liuzhili-ya commented 1 year ago

@shahizat have you successed run TensorRT-LLM on the Nvidia AGX Orin development kit ?

liuzhili-ya commented 1 year ago

@jdemouth-nvidia can I use 'nvcr.io/nvidia/l4t-tensorrt:r8.5.2.2-devel' instead of 'make -C docker release_run' in TensorRT-LLM building ?

jstumpin commented 1 year ago

Looking for Jetson support, found IGX Orin, yet in the article it didn't mention using TensorRT-LLM on Orin's iGPU, instead it presses on the dGPU.

thunder95 commented 1 year ago

also need agx orin support. @jdemouth-nvidia any progress on it?

ljayx commented 1 year ago

+mark

@jdemouth-nvidia Do you have any plans or roadmap?

gehuageyan commented 12 months ago

+mark

ncomly-nvidia commented 11 months ago

Hi all. Jetson / Orin support is still pending & will not be official in the next few releases. Once we have a more concrete timeline we will update here.

whk6688 commented 11 months ago

i can install cuda-12.2 on jetson, but i can't install tensorrt-libs anyways(when run step: python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt ). pls. help me

ncomly-nvidia commented 11 months ago

Hi @whk6688 can you please look at #488? As of right now it is not formally supported

whk6688 commented 10 months ago

when i convert llama model, said:OSError: libcuda.so.1: cannot open shared object file: No such file or directory my platform is orin, so i do not have gpu. Is there any parameter for convering model without gpu? thanks

mikeshi80 commented 10 months ago

when i convert llama model, said:OSError: libcuda.so.1: cannot open shared object file: No such file or directory my platform is orin, so i do not have gpu. Is there any parameter for convering model without gpu? thanks

Try to find the libcuda.so.1 in your system, e.g. /usr/lib/libcuda.so.1, and now setup the env var:

export LD_PRELOAD='/usr/lib/libcuda.so.1'

Maybe it can work.

whk6688 commented 10 months ago

OK. In fact , I run: make -C docker release_run.
ERROR: docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'csv' invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead.: unknown

For orin, how to avoid the error?

WangFengtu1996 commented 10 months ago

Hi, all +mark

is there any update?

WangFengtu1996 commented 10 months ago

Hi, all Is there any other solution to accelerate LLM inference besides Tensorrt LLM currently in Nvidia jetson agx orin dk ? thanks

shahizat commented 10 months ago

Hi @WangFengtu1996, I highly recommend you to check it out the LLM jetson projects by @dusty-nv, especially the implementation of inference via MLC LLM(much faster than llama.cpp). You can find tutorials here https://www.jetson-ai-lab.com/tutorial-intro.html

PH8411 commented 9 months ago

@shahizat have you successed run TensorRT-LLM on the Nvidia AGX Orin development kit ? same question

dusty-nv commented 9 months ago

@shahizat have you successed run TensorRT-LLM on the Nvidia AGX Orin development kit ?

No, TRT-LLM isn't available for Jetson yet (there are other dependencies on newer version of TensorRT and such), and we hope to have a preview release closer to the middle of this year. Until then, I concur with @shahizat to use MLC which is also highly optimized:

When TRT-LLM is released for Jetson, that local_llm package that I use will support it as well. That's the wrapper I use for running optimized LLM APIs in-process (for efficient video streaming for VLMs and handling of large embeddings)

ms1design commented 8 months ago

Oh that's a good news @dusty-nv ! I tried to compile tensorrt-llm on my Orin with some success using jetson-containers stack, based on tensorrt 9.3.0.1 (which installed fine on Orin):

21:53:57 CUDA_ARCHS: 87-real
21:53:57 + TRT_CUDA_VERSION=12.2
21:53:57 + RELEASE_URL_TRT=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.3.0/tensorrt-9.3.0.1.ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz
[...]
21:53:57 -- Building for TensorRT version: 9.3.0, library version: 9
[...]
21:53:57 + pip3 show tensorrt_llm
21:53:59 Name: tensorrt_llm
21:53:59 Version: 0.9.0.dev2024031900
21:53:59 Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models
21:53:59 Home-page: https://github.com/NVIDIA/TensorRT-LLM
21:53:59 Author: NVIDIA Corporation
21:53:59 Author-email: 
21:53:59 License: Apache License 2.0
21:53:59 Location: /usr/local/lib/python3.10/dist-packages
[...]
21:53:59 [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024031900

But it was not very stable when testing (probably I just need to research a bit more how to run proper quantization tests):

17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:batch_size_cache, ranges for this dim are:{(1, 128, 256)}
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:beam_width, ranges for this dim are:{(1, 1, 1)}
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:max_seq_len, ranges for this dim are:{(1, 356, 712)}
17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [I] Build TensorRT engine gpt_350m_float16_tp1_rank0.engine
17:09:12 [03/21/2024-16:09:12] [TRT] [E] 9: [standardEngineBuilder.cpp::buildEngine::2265] Error Code 9: Internal Error (Networks with FP8 precision require hardware with FP8 support.)
17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [E] Engine building failed, please check the error log.
17:09:12 Traceback (most recent call last):
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 412, in <module>
17:09:12     main(args)
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 299, in main
17:09:12     benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 102, in __init__
17:09:12     engine_buffer, build_time = build_gpt(args)
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/build.py", line 843, in build_gpt
17:09:12     assert engine is not None, f'Failed to build engine for rank {runtime_rank}'
17:09:12 AssertionError: Failed to build engine for rank 0

With below test case:

#!/usr/bin/env bash

set -ex

mkdir -p /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m

echo "testing python benchmark..."

python3 /opt/tensorrt_llm/benchmarks/python/benchmark.py \
    -m gpt_350m \
    --mode plugin \
    --batch_size "1;8;64" \
    --input_output_len "60,20;128,20" \
    --log_level verbose \
    --output_dir /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m \
    --quantization fp8 \
    --enable_cuda_graph \
    --strongly_typed

echo "python benchmark OK"
tensorrt_llm Dockerfile

```Dockerfile #--- # name: tensorrt_llm # group: llm # config: config.py # depends: [python, pytorch, optimum, tensorrt, tritonserver] # test: [test.py, test_python_benchmark.sh, test_cpp_benchmark.sh] # requires: '>=35' # notes: The `tensorrt-llm` wheel that's built is saved in the container under `/opt`. https://zhuanlan.zhihu.com/p/663915644 #--- ARG BASE_IMAGE FROM ${BASE_IMAGE} ARG TENSORRT_LLM_BRANCH \ TORCH_CUDA_ARCH_LIST \ CUDA_ARCHS \ CUDA_VERSION \ CUDA_VERSION_MAJOR \ CUDA_VERSION_MINOR \ TRT_TARGETARCH="aarch64" \ SRC_DIR="/tmp/TensorRT-LLM" \ DIST_DIR="/opt/tensorrt_llm" \ CPP_BUILD_DIR="/opt/tensorrt_llm/cpp/build" # Install build dependencies and clone repository RUN set -ex \ && git clone --branch=${TENSORRT_LLM_BRANCH} --depth=1 https://github.com/NVIDIA/TensorRT-LLM.git ${SRC_DIR} \ && git -C ${SRC_DIR} submodule update --init --recursive \ && git -C ${SRC_DIR} lfs pull # Apply sed commands RUN set -ex \ && sed -i \ -e 's|^torch.*|torch|g' \ -e 's|^tensorrt.*|tensorrt|g' \ -e 's|^transformers.*|transformers|g' \ -e 's|^sentencepiece.*|sentencepiece|g' \ -e 's|^diffusers.*|diffusers|g' \ -e 's|^accelerate.*|accelerate|g' \ ${SRC_DIR}/requirements.txt \ && sed -i \ -e 's|${NCCL_LIB}||g' \ ${SRC_DIR}/cpp/tensorrt_llm/CMakeLists.txt \ ${SRC_DIR}/cpp/tensorrt_llm/plugins/CMakeLists.txt \ && sed -i \ -e "s|CUDA_VER=\"[^\"]*\"|CUDA_VER=\"$CUDA_VERSION_MAJOR.$CUDA_VERSION_MINOR\"|g" \ -e 's|^ install_ubuntu_requirements||g' \ ${SRC_DIR}/docker/common/install_tensorrt.sh \ && sed -i '96d' ${SRC_DIR}/docker/common/install_tensorrt.sh \ \ # Install TensorRT 9.x \ && chmod +x ${SRC_DIR}/docker/common/install_*.sh \ && ${SRC_DIR}/docker/common/install_tensorrt.sh ENV LD_LIBRARY_PATH=/usr/local/tensorrt/lib:${LD_LIBRARY_PATH} RUN set -ex \ # Build TensorRT-LLM \ && echo "CUDA_VERSION: ${CUDA_VERSION}" \ && echo "CUDA_ARCHS: ${CUDA_ARCHS}" \ && ${SRC_DIR}/docker/common/install_polygraphy.sh \ && ${SRC_DIR}/docker/common/install_mpi4py.sh \ && python3 ${SRC_DIR}/scripts/build_wheel.py \ --clean \ --build_type Release \ --cuda_architectures "${CUDA_ARCHS}" \ --build_dir ${CPP_BUILD_DIR} \ --dist_dir /opt \ --trt_root /usr/local/tensorrt \ --extra-cmake-vars "ENABLE_MULTI_DEVICE=OFF" \ --benchmarks \ --python_bindings RUN set -ex \ # Copy necessary files \ && cp -r ${SRC_DIR}/cpp/include ${DIST_DIR}/include \ && cp -r ${SRC_DIR}/benchmarks ${DIST_DIR}/benchmarks \ && cp ${CPP_BUILD_DIR}/benchmarks/bertBenchmark ${DIST_DIR}/benchmarks/cpp/ \ && cp ${CPP_BUILD_DIR}/benchmarks/gptManagerBenchmark ${DIST_DIR}/benchmarks/cpp/ \ && cp ${CPP_BUILD_DIR}/benchmarks/gptSessionBenchmark ${DIST_DIR}/benchmarks/cpp/ \ && cp -r ${SRC_DIR}/docs ${DIST_DIR}/docs \ && cp -r ${SRC_DIR}/examples ${DIST_DIR}/examples \ && chmod -R a+w ${DIST_DIR}/examples \ \ # Install TensorRT-LLM package \ && pip3 install --no-cache-dir --verbose /opt/tensorrt_llm*.whl --extra-index-url https://pypi.nvidia.com \ \ # Symlink shared libraries \ && ln -sv $(python3 -c 'import site; print(f"{site.getsitepackages()[0]}/tensorrt_llm/libs")') ${DIST_DIR}/lib \ && test -f ${DIST_DIR}/lib/libnvinfer_plugin_tensorrt_llm.so \ && ln -sv ${DIST_DIR}/lib/libnvinfer_plugin_tensorrt_llm.so ${DIST_DIR}/lib/libnvinfer_plugin_tensorrt_llm.so.9 \ && echo "${DIST_DIR}/lib" > /etc/ld.so.conf.d/tensorrt_llm.conf \ && ldconfig -v | grep nvinfer \ \ # test \ && pip3 show tensorrt_llm \ && python3 -c 'import tensorrt_llm' \ \ # Cleanup unnecessary files \ && rm -rfv \ ${SRC_DIR} \ /opt/*.whl \ ${DIST_DIR}benchmarks/cpp/bertBenchmark.cpp \ ${DIST_DIR}benchmarks/cpp/gptManagerBenchmark.cpp \ ${DIST_DIR}benchmarks/cpp/gptSessionBenchmark.cpp \ ${DIST_DIR}benchmarks/cpp/CMakeLists.txt ```

Instead of suffering I will wait then a bit for full support on Jetson 👏

ZJUwangdaiyin commented 7 months ago

+mark

johnnynunez commented 7 months ago

Oh that's a good news @dusty-nv ! I tried to compile tensorrt-llm on my Orin with some success using jetson-containers stack, based on tensorrt 9.3.0.1 (which installed fine on Orin):

21:53:57 CUDA_ARCHS: 87-real
21:53:57 + TRT_CUDA_VERSION=12.2
21:53:57 + RELEASE_URL_TRT=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.3.0/tensorrt-9.3.0.1.ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz
[...]
21:53:57 -- Building for TensorRT version: 9.3.0, library version: 9
[...]
21:53:57 + pip3 show tensorrt_llm
21:53:59 Name: tensorrt_llm
21:53:59 Version: 0.9.0.dev2024031900
21:53:59 Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models
21:53:59 Home-page: https://github.com/NVIDIA/TensorRT-LLM
21:53:59 Author: NVIDIA Corporation
21:53:59 Author-email: 
21:53:59 License: Apache License 2.0
21:53:59 Location: /usr/local/lib/python3.10/dist-packages
[...]
21:53:59 [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024031900

But it was not very stable when testing (probably I just need to research a bit more how to run proper quantization tests):

17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:batch_size_cache, ranges for this dim are:{(1, 128, 256)}
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:beam_width, ranges for this dim are:{(1, 1, 1)}
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:max_seq_len, ranges for this dim are:{(1, 356, 712)}
17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [I] Build TensorRT engine gpt_350m_float16_tp1_rank0.engine
17:09:12 [03/21/2024-16:09:12] [TRT] [E] 9: [standardEngineBuilder.cpp::buildEngine::2265] Error Code 9: Internal Error (Networks with FP8 precision require hardware with FP8 support.)
17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [E] Engine building failed, please check the error log.
17:09:12 Traceback (most recent call last):
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 412, in <module>
17:09:12     main(args)
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 299, in main
17:09:12     benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 102, in __init__
17:09:12     engine_buffer, build_time = build_gpt(args)
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/build.py", line 843, in build_gpt
17:09:12     assert engine is not None, f'Failed to build engine for rank {runtime_rank}'
17:09:12 AssertionError: Failed to build engine for rank 0

With below test case:

#!/usr/bin/env bash

set -ex

mkdir -p /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m

echo "testing python benchmark..."

python3 /opt/tensorrt_llm/benchmarks/python/benchmark.py \
    -m gpt_350m \
    --mode plugin \
    --batch_size "1;8;64" \
    --input_output_len "60,20;128,20" \
    --log_level verbose \
    --output_dir /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m \
    --quantization fp8 \
    --enable_cuda_graph \
    --strongly_typed

echo "python benchmark OK"

tensorrt_llm Dockerfile Instead of suffering I will wait then a bit for full support on Jetson 👏

Did you try with tensorRT 10? Official versions is with jetpack 6 GA

ms1design commented 7 months ago

I skipped that, @johnnynunez in favour of Home Assistant & its Voice Assistant Pipeline project on Jetson :)

Beside that If @dusty-nv says that we need wait a bit, then theres not much options :) Maybe you, @johnnynunez wanna try some experiments with it? Im happy to share my efforts here: https://github.com/ms1design/jetson-containers/tree/feature/tensorrt-llm-container

dusty-nv commented 7 months ago

@johnnynunez those TRT9 installer is for ARM SBSA and won't actually work on Jetson (@ms1design and I tried it)

I did get it building against TRT10, but not working yet - it is WIP with TRT-LLM team. For now use MLC through NanoLLM (https://www.jetson-ai-lab.com/tutorial_nano-llm.html) and when TRT-LLM is working on Jetson, I will add it as a backend to NanoLLM.

johnnynunez commented 7 months ago

TensorRT 10 GA is out

image
liborui commented 5 months ago

+mark

FenardH commented 4 months ago

TensorRT 10 GA is out image

I succeeded to build tensorRT-LLM on JP6.0 GA for Orin. Thanks for sharing the info.

johnnynunez commented 4 months ago

TensorRT 10 GA is out image

I succeeded to build tensorRT-LLM on JP6.0 GA for Orin. Thanks for sharing the info.

is it working well? I didn't try it

FenardH commented 4 months ago

TensorRT 10 GA is out image

I succeeded to build tensorRT-LLM on JP6.0 GA for Orin. Thanks for sharing the info.

is it working well? I didn't try it

unfortunately no. I tried to build from source for both 0.9.0 and 0.11.0dev. I can import tensorrt_llm in python without any errors but fail to convert any trt models. One of the reasons is that orin does not support nvidia-smi so that NVML cannot read out sys info. see here. I think it's better to wait for official dockers.

krishnarajk commented 4 months ago

Hi,

The nvidia jetson orin nano supports jetpack 6 and a has cuda 12. I would like to know if it supports running TensorRT-LLM on the NVIDIA Jetson Orin Nano Developer Kit.

bigbighuang commented 3 weeks ago

Hi, is there any progress on it?

nv-guomingz commented 1 week ago

assigned to @laikhtewari

johnnynunez commented 1 week ago

Hi, is there any progress on it?

Yes, it is working https://www.jetson-ai-lab.com/tensorrt_llm.html

Now I hope parity with main versions