NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.34k stars 794 forks source link

[ModelRunner] Fix stop and bad words list contiguous for offsets #1815

Open Marks101 opened 1 week ago

Marks101 commented 1 week ago

In our regression tests of the ModelRunner we noticed that in the current main branch (Jun 18, 2024) the stop_words_list feature does not work properly for batch_size > 1. The issue seems to be that the token arrays are not contiguously layed out in memory due to the transpose that is done in this line:

https://github.com/NVIDIA/TensorRT-LLM/blob/2a115dae84f13daaa54727534daa837c534eceb4/tensorrt_llm/runtime/generation.py#L104

This makes the array offsets that are created invalid.

In examples/run.py this features was deactivated for a long time, but it seems that originally the contiguous was implemented here:

https://github.com/NVIDIA/TensorRT-LLM/blob/b777bd64750abf30ca7eda48e8b6ba3c5174aafd/examples/run.py#L403

Thanks for taking a look at this

MartinMarciniszyn commented 4 days ago

@Funatiq , could you please merge this into the main branch?

nv-guomingz commented 4 days ago

@Funatiq , could you please merge this into the main branch?

@byshiue already merged this PR into internal code base this morning.