[ModelRunner] Fix stop and bad words list contiguous for offsets

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.

Apache License 2.0

7.34k stars 794 forks source link

In our regression tests of the ModelRunner we noticed that in the current main branch (Jun 18, 2024) the stop_words_list feature does not work properly for batch_size > 1. The issue seems to be that the token arrays are not contiguously layed out in memory due to the transpose that is done in this line:

https://github.com/NVIDIA/TensorRT-LLM/blob/2a115dae84f13daaa54727534daa837c534eceb4/tensorrt_llm/runtime/generation.py#L104

This makes the array offsets that are created invalid.

In examples/run.py this features was deactivated for a long time, but it seems that originally the contiguous was implemented here:

https://github.com/NVIDIA/TensorRT-LLM/blob/b777bd64750abf30ca7eda48e8b6ba3c5174aafd/examples/run.py#L403

Thanks for taking a look at this

NVIDIA / TensorRT-LLM

[ModelRunner] Fix stop and bad words list contiguous for offsets #1815