TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Added support for calibration with offline dataset.
Added support for Mamba2.
Added support for finish_reason and stop_reason.
Added FP8 support for CodeLlama.
Added __repr__ methods for class Module, thanks to the contribution from @1ytic in #2191.
Added BFloat16 support for fused gated MLP.
Updated ReDrafter beam search logic to match Apple ReDrafter v1.1.
Improved customAllReduce performance.
Draft model now can copy logits directly over MPI to the target model's process in orchestrator mode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference.
NVIDIA Volta GPU support is deprecated and will be removed in a future release.
API Changes
[BREAKING CHANGE] The default max_batch_size of the trtllm-build command is set to 2048.
[BREAKING CHANGE] Remove builder_opt from the BuildConfig class and the trtllm-build command.
Add logits post-processor support to the ModelRunnerCpp class.
Added isParticipant method to the C++ Executor API to check if the current process is a participant in the executor instance.
Model Updates
Added support for NemotronNas, see examples/nemotron_nas/README.md.
Added support for Deepseek-v1, see examples/deepseek_v1/README.md.
Added support for Phi-3.5 models, see examples/phi/README.md.
Fixed Issues
Fixed a typo in tensorrt_llm/models/model_weights_loader.py, thanks to the contribution from @wangkuiyi in #2152.
Fixed duplicated import module in tensorrt_llm/runtime/generation.py, thanks to the contribution from @lkm2835 in #2182.
Enabled share_embedding for the models that have no lm_head in legacy checkpoint conversion path, thanks to the contribution from @lkm2835 in #2232.
Fixed kv_cache_type issue in the Python benchmark, thanks to the contribution from @qingquansong in #2219.
Fixed an issue with SmoothQuant calibration with custom datasets. Thanks to the contribution by @Bhuvanesh09 in #2243.
Fixed an issue surrounding trtllm-build --fast-build with fake or random weights. Thanks to @ZJLi2013 for flagging it in #2135.
Fixed missing use_fused_mlp when constructing BuildConfig from dict, thanks for the fix from @ethnzhng in #2081.
Fixed lookahead batch layout for numNewTokensCumSum. (#2263)
Infrastructure Changes
The dependent ModelOpt version is updated to v0.17.
Documentation
@Sherlock113 added a tech blog to the latest news in #2169, thanks for the contribution.
Known Issues
Replit Code is not supported with the transformers 4.45+
TensorRT-LLM Release 0.14.0
Key Features and Enhancements
LLM
class in the LLM API.finish_reason
andstop_reason
.__repr__
methods for classModule
, thanks to the contribution from @1ytic in #2191.customAllReduce
performance.orchestrator
mode. This fast logits copy reduces the delay between draft token generation and the beginning of target model inference.API Changes
max_batch_size
of thetrtllm-build
command is set to2048
.builder_opt
from theBuildConfig
class and thetrtllm-build
command.ModelRunnerCpp
class.isParticipant
method to the C++Executor
API to check if the current process is a participant in the executor instance.Model Updates
examples/nemotron_nas/README.md
.examples/deepseek_v1/README.md
.examples/phi/README.md
.Fixed Issues
tensorrt_llm/models/model_weights_loader.py
, thanks to the contribution from @wangkuiyi in #2152.tensorrt_llm/runtime/generation.py
, thanks to the contribution from @lkm2835 in #2182.share_embedding
for the models that have nolm_head
in legacy checkpoint conversion path, thanks to the contribution from @lkm2835 in #2232.kv_cache_type
issue in the Python benchmark, thanks to the contribution from @qingquansong in #2219.trtllm-build --fast-build
with fake or random weights. Thanks to @ZJLi2013 for flagging it in #2135.use_fused_mlp
when constructingBuildConfig
from dict, thanks for the fix from @ethnzhng in #2081.numNewTokensCumSum
. (#2263)Infrastructure Changes
Documentation
Known Issues