NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.58k stars 974 forks source link

Qwen-VL-Chat has an error #2206

Closed xiangxinhello closed 1 month ago

xiangxinhello commented 1 month ago

System Info

A100-PCIe-40GB Tensorrt-LLM-verison:0.12.0

Who can help?

@sunnyqgg

Information

Tasks

Reproduction

The effects of using trt-llm-0.12.0 with fp16 and fp32 precision are significantly different from those of Qwen-VL found at https://github.com/QwenLM/Qwen-VL. trt-llm-0.12.0 with fp16: Image

trt-llm-0.12.0 with fp32:

Image

https://github.com/QwenLM/Qwen-VL:

Image

Expected behavior

The reasoning results should be consistent with https://github.com/QwenLM/Qwen-VL.

actual behavior

tensorrt-llm not support

additional notes

no

xiangxinhello commented 1 month ago

Hi, @lfr-0531 , can you help me solve this problem? thanks!

lfr-0531 commented 1 month ago

Similar issue to https://github.com/NVIDIA/TensorRT-LLM/issues/2241