TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
In our regression tests of the ModelRunner we noticed that in the current main branch (Jun 18, 2024) the stop_words_list feature does not work properly for batch_size > 1. The issue seems to be that the token arrays are not contiguously layed out in memory due to the transpose that is done in this line:
In our regression tests of the
ModelRunner
we noticed that in the current main branch (Jun 18, 2024) thestop_words_list
feature does not work properly for batch_size > 1. The issue seems to be that the token arrays are not contiguously layed out in memory due to the transpose that is done in this line:https://github.com/NVIDIA/TensorRT-LLM/blob/2a115dae84f13daaa54727534daa837c534eceb4/tensorrt_llm/runtime/generation.py#L104
This makes the array offsets that are created invalid.
In
examples/run.py
this features was deactivated for a long time, but it seems that originally the contiguous was implemented here:https://github.com/NVIDIA/TensorRT-LLM/blob/b777bd64750abf30ca7eda48e8b6ba3c5174aafd/examples/run.py#L403
Thanks for taking a look at this