NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.64k stars 986 forks source link

run.py --run_profiling respects stop token and is unsuitable for performance comparisons #2407

Open aikitoria opened 1 week ago

aikitoria commented 1 week ago

System Info

Latest git master, system info not relevant

Who can help?

@kaiyux

Information

Tasks

Reproduction

Use the run.py --run_profiling script with different quantized models to compare the performance, notice results are not making sense and inconsistent.

Expected behavior

Passing the --run_profiling flag should disable the stop token so the generation always runs for exactly the length specified in --max_output_len.

actual behavior

Passing the --run_profiling flag will measure the time it takes to generate 10 times a varying amount of tokens (depending on whether or when it decides to emit a stop token) which is pretty useless.

additional notes

Passing --end_id 12 (or some other unused token id) functions as a workaround to disable the stop token and make the profiling result actually consistent.

hello-11 commented 1 week ago

@aikitoria Thanks for your interest in TrtLLM. You can use the benchmark file's scripts to test your performance comparison.