Update TensorRT-LLM - Githubissues

Model Support
- Support Qwen1.5 MoE A2.7B
- Support Phi 3 vision multimodal
Features
- Encoder-Decoder C++ Runtime TP Support
- Explicit draft tokens inflight batching
- Support local file for calibration
- Thanks to the contribution from @DreamGenX: https://github.com/NVIDIA/TensorRT-LLM/pull/1762
- Add batched logits post processor
- Add Hopper qgmma kernel to XQA JIT codepath
- MoE enable TP+EP
- Add lookahead decoding layer
API
- [BREAKING CHANGE] Setup buffers for explicit draft tokens decoding
- [BREAKING CHANGE] Replace all occurrences of max_output_len with max_seq_len
- This involves trtllm-build and benchmark related parameters
- [BREAKING CHANGE] Remove GptSession Python bindings
- [BREAKING CHANGE] Add runtime max batch size to gptManagerBenchmark
- Support remaining executor API options in HLAPI
- Support get_stats and aget_stats in HL Executor while using multi-gpu
- Add iterLatencyMilliSec to stats and iteration log
Bug fixes
- Can't convert-checkpoint Mistral 7B v0.3
- Thanks to the contribution from @Ace-RR: https://github.com/NVIDIA/TensorRT-LLM/issues/1732
- Inflight batching for fp8 Llama and Mixtral is broken
- Thanks to the contribution from @bprus: https://github.com/NVIDIA/TensorRT-LLM/issues/1738
- quantize.py fails to export important data to config.json
- Thanks to the contribution from @janpetrov: https://github.com/NVIDIA/TensorRT-LLM/issues/1676
- Refactor the dynamic decoder params
- Fix long runtime for MOE models when using FAST_BUILD
- Enhance ITensor::slice to extreme cases
- HLAPI exits gracefully on exceptions
- NaN appears in the result under the one shot all reduce strategy
- Cache ncclComm_t as weak_ptr and wrap it as shared_ptr to avoid accidentally destroyed
Memory optimization
- Support stream reader to reduce peak memory when using weight streaming
Benchmark
Performance
- Optimize the build time when XQA JIT is enabled
- Reduce number of stream when using fused decoder
Infra
Documentation
- Update documents about GEMM plugins
- Polish enc-dec readme to reflect recent changes
- Update Mixtral example docs to include Mixtral-8x22B instructions
- Simplify recurrent gemma README

NVIDIA / TensorRT-LLM

Update TensorRT-LLM #1793