TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Added support for Minitron, see examples/nemotron.
Added a GPT Variant - Granite(20B and 34B), see “GPT Variant - Granite” section in examples/gpt/README.md.
Added support for LLaVA-OneVision model, see “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in examples/multimodal/README.md.
Features
Added a trtllm-serve command to launch a FastAPI based server.
Added support for prompt-lookup speculative decoding, see examples/prompt_lookup/README.md.
Added FP8 support for Nemotron NAS 51B. See examples/nemotron_nas/README.md.
Integrated the QServe w4a8 per-group/per-channel quantization, see “w4aINT8 quantization (QServe)” section in examples/llama/README.md.
Added a C++ example for fast logits using the executor API, see “executorExampleFastLogits” section in examples/cpp/executor/README.md.
API
[BREAKING CHANGE] auto is used as the default value for --dtype option in quantize and checkpoints conversion scripts.
[BREAKING CHANGE] Deprecated gptManager API path in gptManagerBenchmark.
Bug fixes
Fix the issue that the kernel moeTopK() cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug.
Fixed an assertion failure on crossKvCacheFraction. (#2419)
Fixed an issue when using smoothquant to quantize Qwen2 model. (#2370)
Fixed a PDL typo in docs/source/performance/perf-benchmarking.md, thanks @MARD1NO for pointing it out in #2425.
Infrastructure Changes
The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.10-py3.
The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.10-py3.
The dependent TensorRT version is updated to 10.6.
The dependent CUDA version is updated to 12.6.2.
The dependent PyTorch version is updated to 2.5.1.
examples/nemotron
.examples/gpt/README.md
.examples/multimodal/README.md
.trtllm-serve
command to launch a FastAPI based server.examples/prompt_lookup/README.md
.examples/nemotron_nas/README.md
.examples/llama/README.md
.executor
API, see “executorExampleFastLogits” section inexamples/cpp/executor/README.md
.auto
is used as the default value for--dtype
option in quantize and checkpoints conversion scripts.moeTopK()
cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug.crossKvCacheFraction
. (#2419)docs/source/performance/perf-benchmarking.md
, thanks @MARD1NO for pointing it out in #2425.nvcr.io/nvidia/pytorch:24.10-py3
.nvcr.io/nvidia/tritonserver:24.10-py3
.