Update TensorRT-LLM - Githubissues

Model Support
- Added support for Minitron, see examples/nemotron.
- Added a GPT Variant - Granite(20B and 34B), see “GPT Variant - Granite” section in examples/gpt/README.md.
- Added support for LLaVA-OneVision model, see “LLaVA, LLaVa-NeXT, LLaVA-OneVision and VILA” section in examples/multimodal/README.md.
Features
- Added a trtllm-serve command to launch a FastAPI based server.
- Added support for prompt-lookup speculative decoding, see examples/prompt_lookup/README.md.
- Added FP8 support for Nemotron NAS 51B. See examples/nemotron_nas/README.md.
- Integrated the QServe w4a8 per-group/per-channel quantization, see “w4aINT8 quantization (QServe)” section in examples/llama/README.md.
- Added a C++ example for fast logits using the executor API, see “executorExampleFastLogits” section in examples/cpp/executor/README.md.
API
- [BREAKING CHANGE] auto is used as the default value for --dtype option in quantize and checkpoints conversion scripts.
- [BREAKING CHANGE] Deprecated gptManager API path in gptManagerBenchmark.
Bug fixes
- Fix the issue that the kernel moeTopK() cannot find the correct expert when the number of experts is not a power of two. Thanks @dongjiyingdjy for reporting this bug.
- Fixed an assertion failure on crossKvCacheFraction. (#2419)
- Fixed an issue when using smoothquant to quantize Qwen2 model. (#2370)
- Fixed a PDL typo in docs/source/performance/perf-benchmarking.md, thanks @MARD1NO for pointing it out in #2425.
Infrastructure Changes
- The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:24.10-py3.
- The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:24.10-py3.
- The dependent TensorRT version is updated to 10.6.
- The dependent CUDA version is updated to 12.6.2.
- The dependent PyTorch version is updated to 2.5.1.

NVIDIA / TensorRT-LLM

Update TensorRT-LLM #2436