Update TensorRT-LLM - Githubissues

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

7.34k stars 794 forks source link

API
- [BREAKING CHANGE] Remove unnecessary --weight_only_precision argument from trtllm-build command.
Bug fixes
- Raise error when autopp detects unsupported quant plugin #1626.
- Fix the issue that shared_embedding_table is not being set when loading Gemma #1799, thanks to the contribution from @mfuntowicz.
Benchmark
- Add Medusa choices to the gptManagerBenchmark.
Infra
- Base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.05-py3.
- Base Docker image for TensorRT-LLM backend is updated to nvcr.io/nvidia/tritonserver:24.05-py3.
- The dependent TensorRT version is updated to 10.1.

NVIDIA / TensorRT-LLM

Update TensorRT-LLM #1835