run.py --run_profiling respects stop token and is unsuitable for performance comparisons

System Info

Latest git master, system info not relevant

Who can help?

@kaiyux

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[x] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Use the run.py --run_profiling script with different quantized models to compare the performance, notice results are not making sense and inconsistent.

Expected behavior

Passing the --run_profiling flag should disable the stop token so the generation always runs for exactly the length specified in --max_output_len.

actual behavior

Passing the --run_profiling flag will measure the time it takes to generate 10 times a varying amount of tokens (depending on whether or when it decides to emit a stop token) which is pretty useless.

additional notes

Passing --end_id 12 (or some other unused token id) functions as a workaround to disable the stop token and make the profiling result actually consistent.

NVIDIA / TensorRT-LLM