NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.65k stars 988 forks source link

How to identify the rest toke latency? #1761

Open RobinJYM opened 5 months ago

RobinJYM commented 5 months ago

System Info

Who can help?

@kaiyux

Information

Tasks

Reproduction

run benchmark.py

Expected behavior

as expected

actual behavior

Hi, Here is the output of --input_output_len "1024,512": [BENCHMARK] model_name chatglm3_6b world_size 1 num_heads 32 num_kv_heads 2 num_layers 28 hidden_size 4096 vocab_size 65024 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 1024 output_length 512 gpu_peak_mem(gb) 13.99 build_time(s) 11.05 tokens_per_sec 62.04 percentile95(ms) 8261.234 percentile99(ms) 8261.234 **latency(ms) 8252.869** compute_cap sm86 quantization QuantMode.0 **generation_time(ms) 8062.798 total_generated_tokens 511.0** generation_tokens_per_second 63.377 and here is the output of --input_output_len "1024,1": [BENCHMARK] model_name chatglm3_6b world_size 1 num_heads 32 num_kv_heads 2 num_layers 28 hidden_size 4096 vocab_size 65024 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 1024 output_length 1 gpu_peak_mem(gb) 12.742 build_time(s) 0 tokens_per_sec 5.22 percentile95(ms) 192.648 percentile99(ms) 192.727 **latency(ms) 191.697** compute_cap sm86 quantization QuantMode.0 generation_time(ms) 0.013 total_generated_tokens 0.0 generation_tokens_per_second 0.0 How can we get rest(second) token latency? is it generation_time/total_generated_tokens = 8062.768/511 = 15.77 ?

Thanks! BR

additional notes

No

nv-guomingz commented 5 months ago

Hi @kaiyux ,would u please take a look this question?

RobinJYM commented 5 months ago

Any Insights?

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

kaiyux commented 12 hours ago

Hi @RobinJYM , generation_time here means latency of generation stage, so if I understand the question correctly, if you want the latency of "rest tokens apart from the first token", you could just use generation_time in the report.

BTW, please note that gptSessionBenchmark is deprecated because we are not recommending to benchmark static batching anymore. Please use trtllm-bench or gptManagerBenchmark instead. We're actively working on trtllm-bench command to make it stable and ready to reproduce performance numbers.

Please refer to perf-overview.md and cpp benchmark for more details. Thanks a lot for the support.

nv-guomingz commented 10 hours ago

Hi @RobinJYM do u still have further issue or question now? If not, we'll close it soon.