TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)
Reproduction
Use the run.py --run_profiling script with different quantized models to compare the performance, notice results are not making sense and inconsistent.
Expected behavior
Passing the --run_profiling flag should disable the stop token so the generation always runs for exactly the length specified in --max_output_len.
actual behavior
Passing the --run_profiling flag will measure the time it takes to generate 10 times a varying amount of tokens (depending on whether or when it decides to emit a stop token) which is pretty useless.
additional notes
Passing --end_id 12 (or some other unused token id) functions as a workaround to disable the stop token and make the profiling result actually consistent.
System Info
Latest git master, system info not relevant
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Use the
run.py --run_profiling
script with different quantized models to compare the performance, notice results are not making sense and inconsistent.Expected behavior
Passing the
--run_profiling
flag should disable the stop token so the generation always runs for exactly the length specified in--max_output_len
.actual behavior
Passing the
--run_profiling
flag will measure the time it takes to generate 10 times a varying amount of tokens (depending on whether or when it decides to emit a stop token) which is pretty useless.additional notes
Passing
--end_id 12
(or some other unused token id) functions as a workaround to disable the stop token and make the profiling result actually consistent.