TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)
Reproduction
The effects of using trt-llm-0.12.0 with fp16 and fp32 precision are significantly different from those of Qwen-VL found at https://github.com/QwenLM/Qwen-VL.
trt-llm-0.12.0 with fp16:
System Info
A100-PCIe-40GB Tensorrt-LLM-verison:0.12.0
Who can help?
@sunnyqgg
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The effects of using trt-llm-0.12.0 with fp16 and fp32 precision are significantly different from those of Qwen-VL found at https://github.com/QwenLM/Qwen-VL. trt-llm-0.12.0 with fp16:
trt-llm-0.12.0 with fp32:
https://github.com/QwenLM/Qwen-VL:
Expected behavior
The reasoning results should be consistent with https://github.com/QwenLM/Qwen-VL.
actual behavior
tensorrt-llm not support
additional notes
no