TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Hi @tloen , the issue should be addressed after this PR, can you please try and see if that solves the problem? Feel free to let us know if there are any more questions, thanks!
System Info
Who can help?
@kaiyux @byshiue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Inside
examples/run.py
, add a for loop to the generation.Expected behavior
actual behavior
Nondeterminism and incorrect responses after first iteration.
additional notes
Model is Llama architecture. max_draft_len is 107. Error doesn't happen when number of verification branches is zero or window size is 1.