NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.33k stars 935 forks source link

Nvidia Jetson device Support #62

Open shahizat opened 11 months ago

shahizat commented 11 months ago

Dear Nvidia Team,

I would like to request support for running TensorRT-LLM on the Nvidia AGX Orin development kit.

Thank you!

Best regards, Shakhizat

jdemouth-nvidia commented 11 months ago

Hi @shahizat ,

We do not officially support Orin yet but we have colleagues (in our automotive division) who are working on enabling TensorRT-LLM on Orin. I’ll ask them if they can provide you with feedback. Also, if you want to give it a try, we should be able to help you (as much as other priorities allow us).

Thanks, Julien

liuzhili-ya commented 10 months ago

@jdemouth-nvidia TensorRT-LLM use CUDA 12 default,but jetson orin only support cuda 11,is there any possible to run TensorRT-LLM on Nvidia AGX Orin development kit ?

liuzhili-ya commented 10 months ago

@shahizat have you successed run TensorRT-LLM on the Nvidia AGX Orin development kit ?

liuzhili-ya commented 10 months ago

@jdemouth-nvidia can I use 'nvcr.io/nvidia/l4t-tensorrt:r8.5.2.2-devel' instead of 'make -C docker release_run' in TensorRT-LLM building ?

jstumpin commented 10 months ago

Looking for Jetson support, found IGX Orin, yet in the article it didn't mention using TensorRT-LLM on Orin's iGPU, instead it presses on the dGPU.

thunder95 commented 10 months ago

also need agx orin support. @jdemouth-nvidia any progress on it?

ljayx commented 10 months ago

+mark

@jdemouth-nvidia Do you have any plans or roadmap?

gehuageyan commented 10 months ago

+mark

ncomly-nvidia commented 10 months ago

Hi all. Jetson / Orin support is still pending & will not be official in the next few releases. Once we have a more concrete timeline we will update here.

whk6688 commented 9 months ago

i can install cuda-12.2 on jetson, but i can't install tensorrt-libs anyways(when run step: python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt ). pls. help me

ncomly-nvidia commented 9 months ago

Hi @whk6688 can you please look at #488? As of right now it is not formally supported

whk6688 commented 9 months ago

when i convert llama model, said:OSError: libcuda.so.1: cannot open shared object file: No such file or directory my platform is orin, so i do not have gpu. Is there any parameter for convering model without gpu? thanks

mikeshi80 commented 8 months ago

when i convert llama model, said:OSError: libcuda.so.1: cannot open shared object file: No such file or directory my platform is orin, so i do not have gpu. Is there any parameter for convering model without gpu? thanks

Try to find the libcuda.so.1 in your system, e.g. /usr/lib/libcuda.so.1, and now setup the env var:

export LD_PRELOAD='/usr/lib/libcuda.so.1'

Maybe it can work.

whk6688 commented 8 months ago

OK. In fact , I run: make -C docker release_run.
ERROR: docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'csv' invoking the NVIDIA Container Runtime Hook directly (e.g. specifying the docker --gpus flag) is not supported. Please use the NVIDIA Container Runtime (e.g. specify the --runtime=nvidia flag) instead.: unknown

For orin, how to avoid the error?

WangFengtu1996 commented 8 months ago

Hi, all +mark

is there any update?

WangFengtu1996 commented 8 months ago

Hi, all Is there any other solution to accelerate LLM inference besides Tensorrt LLM currently in Nvidia jetson agx orin dk ? thanks

shahizat commented 8 months ago

Hi @WangFengtu1996, I highly recommend you to check it out the LLM jetson projects by @dusty-nv, especially the implementation of inference via MLC LLM(much faster than llama.cpp). You can find tutorials here https://www.jetson-ai-lab.com/tutorial-intro.html

PH8411 commented 8 months ago

@shahizat have you successed run TensorRT-LLM on the Nvidia AGX Orin development kit ? same question

dusty-nv commented 8 months ago

@shahizat have you successed run TensorRT-LLM on the Nvidia AGX Orin development kit ?

No, TRT-LLM isn't available for Jetson yet (there are other dependencies on newer version of TensorRT and such), and we hope to have a preview release closer to the middle of this year. Until then, I concur with @shahizat to use MLC which is also highly optimized:

When TRT-LLM is released for Jetson, that local_llm package that I use will support it as well. That's the wrapper I use for running optimized LLM APIs in-process (for efficient video streaming for VLMs and handling of large embeddings)

ms1design commented 6 months ago

Oh that's a good news @dusty-nv ! I tried to compile tensorrt-llm on my Orin with some success using jetson-containers stack, based on tensorrt 9.3.0.1 (which installed fine on Orin):

21:53:57 CUDA_ARCHS: 87-real
21:53:57 + TRT_CUDA_VERSION=12.2
21:53:57 + RELEASE_URL_TRT=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.3.0/tensorrt-9.3.0.1.ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz
[...]
21:53:57 -- Building for TensorRT version: 9.3.0, library version: 9
[...]
21:53:57 + pip3 show tensorrt_llm
21:53:59 Name: tensorrt_llm
21:53:59 Version: 0.9.0.dev2024031900
21:53:59 Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models
21:53:59 Home-page: https://github.com/NVIDIA/TensorRT-LLM
21:53:59 Author: NVIDIA Corporation
21:53:59 Author-email: 
21:53:59 License: Apache License 2.0
21:53:59 Location: /usr/local/lib/python3.10/dist-packages
[...]
21:53:59 [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024031900

But it was not very stable when testing (probably I just need to research a bit more how to run proper quantization tests):

17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:batch_size_cache, ranges for this dim are:{(1, 128, 256)}
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:beam_width, ranges for this dim are:{(1, 1, 1)}
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:max_seq_len, ranges for this dim are:{(1, 356, 712)}
17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [I] Build TensorRT engine gpt_350m_float16_tp1_rank0.engine
17:09:12 [03/21/2024-16:09:12] [TRT] [E] 9: [standardEngineBuilder.cpp::buildEngine::2265] Error Code 9: Internal Error (Networks with FP8 precision require hardware with FP8 support.)
17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [E] Engine building failed, please check the error log.
17:09:12 Traceback (most recent call last):
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 412, in <module>
17:09:12     main(args)
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 299, in main
17:09:12     benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 102, in __init__
17:09:12     engine_buffer, build_time = build_gpt(args)
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/build.py", line 843, in build_gpt
17:09:12     assert engine is not None, f'Failed to build engine for rank {runtime_rank}'
17:09:12 AssertionError: Failed to build engine for rank 0

With below test case:

#!/usr/bin/env bash

set -ex

mkdir -p /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m

echo "testing python benchmark..."

python3 /opt/tensorrt_llm/benchmarks/python/benchmark.py \
    -m gpt_350m \
    --mode plugin \
    --batch_size "1;8;64" \
    --input_output_len "60,20;128,20" \
    --log_level verbose \
    --output_dir /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m \
    --quantization fp8 \
    --enable_cuda_graph \
    --strongly_typed

echo "python benchmark OK"
tensorrt_llm Dockerfile

```Dockerfile #--- # name: tensorrt_llm # group: llm # config: config.py # depends: [python, pytorch, optimum, tensorrt, tritonserver] # test: [test.py, test_python_benchmark.sh, test_cpp_benchmark.sh] # requires: '>=35' # notes: The `tensorrt-llm` wheel that's built is saved in the container under `/opt`. https://zhuanlan.zhihu.com/p/663915644 #--- ARG BASE_IMAGE FROM ${BASE_IMAGE} ARG TENSORRT_LLM_BRANCH \ TORCH_CUDA_ARCH_LIST \ CUDA_ARCHS \ CUDA_VERSION \ CUDA_VERSION_MAJOR \ CUDA_VERSION_MINOR \ TRT_TARGETARCH="aarch64" \ SRC_DIR="/tmp/TensorRT-LLM" \ DIST_DIR="/opt/tensorrt_llm" \ CPP_BUILD_DIR="/opt/tensorrt_llm/cpp/build" # Install build dependencies and clone repository RUN set -ex \ && git clone --branch=${TENSORRT_LLM_BRANCH} --depth=1 https://github.com/NVIDIA/TensorRT-LLM.git ${SRC_DIR} \ && git -C ${SRC_DIR} submodule update --init --recursive \ && git -C ${SRC_DIR} lfs pull # Apply sed commands RUN set -ex \ && sed -i \ -e 's|^torch.*|torch|g' \ -e 's|^tensorrt.*|tensorrt|g' \ -e 's|^transformers.*|transformers|g' \ -e 's|^sentencepiece.*|sentencepiece|g' \ -e 's|^diffusers.*|diffusers|g' \ -e 's|^accelerate.*|accelerate|g' \ ${SRC_DIR}/requirements.txt \ && sed -i \ -e 's|${NCCL_LIB}||g' \ ${SRC_DIR}/cpp/tensorrt_llm/CMakeLists.txt \ ${SRC_DIR}/cpp/tensorrt_llm/plugins/CMakeLists.txt \ && sed -i \ -e "s|CUDA_VER=\"[^\"]*\"|CUDA_VER=\"$CUDA_VERSION_MAJOR.$CUDA_VERSION_MINOR\"|g" \ -e 's|^ install_ubuntu_requirements||g' \ ${SRC_DIR}/docker/common/install_tensorrt.sh \ && sed -i '96d' ${SRC_DIR}/docker/common/install_tensorrt.sh \ \ # Install TensorRT 9.x \ && chmod +x ${SRC_DIR}/docker/common/install_*.sh \ && ${SRC_DIR}/docker/common/install_tensorrt.sh ENV LD_LIBRARY_PATH=/usr/local/tensorrt/lib:${LD_LIBRARY_PATH} RUN set -ex \ # Build TensorRT-LLM \ && echo "CUDA_VERSION: ${CUDA_VERSION}" \ && echo "CUDA_ARCHS: ${CUDA_ARCHS}" \ && ${SRC_DIR}/docker/common/install_polygraphy.sh \ && ${SRC_DIR}/docker/common/install_mpi4py.sh \ && python3 ${SRC_DIR}/scripts/build_wheel.py \ --clean \ --build_type Release \ --cuda_architectures "${CUDA_ARCHS}" \ --build_dir ${CPP_BUILD_DIR} \ --dist_dir /opt \ --trt_root /usr/local/tensorrt \ --extra-cmake-vars "ENABLE_MULTI_DEVICE=OFF" \ --benchmarks \ --python_bindings RUN set -ex \ # Copy necessary files \ && cp -r ${SRC_DIR}/cpp/include ${DIST_DIR}/include \ && cp -r ${SRC_DIR}/benchmarks ${DIST_DIR}/benchmarks \ && cp ${CPP_BUILD_DIR}/benchmarks/bertBenchmark ${DIST_DIR}/benchmarks/cpp/ \ && cp ${CPP_BUILD_DIR}/benchmarks/gptManagerBenchmark ${DIST_DIR}/benchmarks/cpp/ \ && cp ${CPP_BUILD_DIR}/benchmarks/gptSessionBenchmark ${DIST_DIR}/benchmarks/cpp/ \ && cp -r ${SRC_DIR}/docs ${DIST_DIR}/docs \ && cp -r ${SRC_DIR}/examples ${DIST_DIR}/examples \ && chmod -R a+w ${DIST_DIR}/examples \ \ # Install TensorRT-LLM package \ && pip3 install --no-cache-dir --verbose /opt/tensorrt_llm*.whl --extra-index-url https://pypi.nvidia.com \ \ # Symlink shared libraries \ && ln -sv $(python3 -c 'import site; print(f"{site.getsitepackages()[0]}/tensorrt_llm/libs")') ${DIST_DIR}/lib \ && test -f ${DIST_DIR}/lib/libnvinfer_plugin_tensorrt_llm.so \ && ln -sv ${DIST_DIR}/lib/libnvinfer_plugin_tensorrt_llm.so ${DIST_DIR}/lib/libnvinfer_plugin_tensorrt_llm.so.9 \ && echo "${DIST_DIR}/lib" > /etc/ld.so.conf.d/tensorrt_llm.conf \ && ldconfig -v | grep nvinfer \ \ # test \ && pip3 show tensorrt_llm \ && python3 -c 'import tensorrt_llm' \ \ # Cleanup unnecessary files \ && rm -rfv \ ${SRC_DIR} \ /opt/*.whl \ ${DIST_DIR}benchmarks/cpp/bertBenchmark.cpp \ ${DIST_DIR}benchmarks/cpp/gptManagerBenchmark.cpp \ ${DIST_DIR}benchmarks/cpp/gptSessionBenchmark.cpp \ ${DIST_DIR}benchmarks/cpp/CMakeLists.txt ```

Instead of suffering I will wait then a bit for full support on Jetson 👏

ZJUwangdaiyin commented 5 months ago

+mark

johnnynunez commented 5 months ago

Oh that's a good news @dusty-nv ! I tried to compile tensorrt-llm on my Orin with some success using jetson-containers stack, based on tensorrt 9.3.0.1 (which installed fine on Orin):

21:53:57 CUDA_ARCHS: 87-real
21:53:57 + TRT_CUDA_VERSION=12.2
21:53:57 + RELEASE_URL_TRT=https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/9.3.0/tensorrt-9.3.0.1.ubuntu-22.04.aarch64-gnu.cuda-12.2.tar.gz
[...]
21:53:57 -- Building for TensorRT version: 9.3.0, library version: 9
[...]
21:53:57 + pip3 show tensorrt_llm
21:53:59 Name: tensorrt_llm
21:53:59 Version: 0.9.0.dev2024031900
21:53:59 Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models
21:53:59 Home-page: https://github.com/NVIDIA/TensorRT-LLM
21:53:59 Author: NVIDIA Corporation
21:53:59 Author-email: 
21:53:59 License: Apache License 2.0
21:53:59 Location: /usr/local/lib/python3.10/dist-packages
[...]
21:53:59 [TensorRT-LLM] TensorRT-LLM version: 0.9.0.dev2024031900

But it was not very stable when testing (probably I just need to research a bit more how to run proper quantization tests):

17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:batch_size_cache, ranges for this dim are:{(1, 128, 256)}
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:beam_width, ranges for this dim are:{(1, 1, 1)}
17:09:11 [03/21/2024-16:09:11] [TRT-LLM] [V] Validating dimension:max_seq_len, ranges for this dim are:{(1, 356, 712)}
17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [I] Build TensorRT engine gpt_350m_float16_tp1_rank0.engine
17:09:12 [03/21/2024-16:09:12] [TRT] [E] 9: [standardEngineBuilder.cpp::buildEngine::2265] Error Code 9: Internal Error (Networks with FP8 precision require hardware with FP8 support.)
17:09:12 [03/21/2024-16:09:12] [TRT-LLM] [E] Engine building failed, please check the error log.
17:09:12 Traceback (most recent call last):
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 412, in <module>
17:09:12     main(args)
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/benchmark.py", line 299, in main
17:09:12     benchmarker = GPTBenchmark(args, batch_size_options, in_out_len_options,
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/gpt_benchmark.py", line 102, in __init__
17:09:12     engine_buffer, build_time = build_gpt(args)
17:09:12   File "/opt/tensorrt_llm/benchmarks/python/build.py", line 843, in build_gpt
17:09:12     assert engine is not None, f'Failed to build engine for rank {runtime_rank}'
17:09:12 AssertionError: Failed to build engine for rank 0

With below test case:

#!/usr/bin/env bash

set -ex

mkdir -p /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m

echo "testing python benchmark..."

python3 /opt/tensorrt_llm/benchmarks/python/benchmark.py \
    -m gpt_350m \
    --mode plugin \
    --batch_size "1;8;64" \
    --input_output_len "60,20;128,20" \
    --log_level verbose \
    --output_dir /opt/tensorrt_llm/benchmarks/trt_engines/gpt_350m \
    --quantization fp8 \
    --enable_cuda_graph \
    --strongly_typed

echo "python benchmark OK"

tensorrt_llm Dockerfile Instead of suffering I will wait then a bit for full support on Jetson 👏

Did you try with tensorRT 10? Official versions is with jetpack 6 GA

ms1design commented 5 months ago

I skipped that, @johnnynunez in favour of Home Assistant & its Voice Assistant Pipeline project on Jetson :)

Beside that If @dusty-nv says that we need wait a bit, then theres not much options :) Maybe you, @johnnynunez wanna try some experiments with it? Im happy to share my efforts here: https://github.com/ms1design/jetson-containers/tree/feature/tensorrt-llm-container

dusty-nv commented 5 months ago

@johnnynunez those TRT9 installer is for ARM SBSA and won't actually work on Jetson (@ms1design and I tried it)

I did get it building against TRT10, but not working yet - it is WIP with TRT-LLM team. For now use MLC through NanoLLM (https://www.jetson-ai-lab.com/tutorial_nano-llm.html) and when TRT-LLM is working on Jetson, I will add it as a backend to NanoLLM.

johnnynunez commented 5 months ago

TensorRT 10 GA is out

image
liborui commented 3 months ago

+mark

FenardH commented 3 months ago

TensorRT 10 GA is out image

I succeeded to build tensorRT-LLM on JP6.0 GA for Orin. Thanks for sharing the info.

johnnynunez commented 3 months ago

TensorRT 10 GA is out image

I succeeded to build tensorRT-LLM on JP6.0 GA for Orin. Thanks for sharing the info.

is it working well? I didn't try it

FenardH commented 3 months ago

TensorRT 10 GA is out image

I succeeded to build tensorRT-LLM on JP6.0 GA for Orin. Thanks for sharing the info.

is it working well? I didn't try it

unfortunately no. I tried to build from source for both 0.9.0 and 0.11.0dev. I can import tensorrt_llm in python without any errors but fail to convert any trt models. One of the reasons is that orin does not support nvidia-smi so that NVML cannot read out sys info. see here. I think it's better to wait for official dockers.

krishnarajk commented 2 months ago

Hi,

The nvidia jetson orin nano supports jetpack 6 and a has cuda 12. I would like to know if it supports running TensorRT-LLM on the NVIDIA Jetson Orin Nano Developer Kit.