NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.34k stars 936 forks source link

llava batch infer, only the result corresponding to the longest prompt is correct, while other results are incorrect #1881

Open lss15151161 opened 3 months ago

lss15151161 commented 3 months ago

version: TensorRT-LLM 0.10.0 the official script(TensorRT-LLM/examples/multimodal/run.py) use same prompt repeat to form a batch. but if I use different prompts to form a batch, the result is incorrect. how to solve it? because the result corresponding to the longest prompt is correct, I think the reason is padding.

image

if i use the same prompts, the result is correct

image
amukkara commented 3 months ago

@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.

lss15151161 commented 2 months ago

@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.

thx for reply~ so, do you know what should I do if I want to do batch inference?

lss15151161 commented 2 months ago

@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.

@lss15151161 This example does not support using different prompts in a batch. Yes, the issue is pad tokens will added to end of shorter post_prompt when prompts are different.

and doesn't trtllm remove pads internally?