NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.3k stars 927 forks source link

How to identify the rest toke latency? #1761

Open RobinJYM opened 3 months ago

RobinJYM commented 3 months ago

System Info

Who can help?

@kaiyux

Information

Tasks

Reproduction

run benchmark.py

Expected behavior

as expected

actual behavior

Hi, Here is the output of --input_output_len "1024,512": [BENCHMARK] model_name chatglm3_6b world_size 1 num_heads 32 num_kv_heads 2 num_layers 28 hidden_size 4096 vocab_size 65024 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 1024 output_length 512 gpu_peak_mem(gb) 13.99 build_time(s) 11.05 tokens_per_sec 62.04 percentile95(ms) 8261.234 percentile99(ms) 8261.234 **latency(ms) 8252.869** compute_cap sm86 quantization QuantMode.0 **generation_time(ms) 8062.798 total_generated_tokens 511.0** generation_tokens_per_second 63.377 and here is the output of --input_output_len "1024,1": [BENCHMARK] model_name chatglm3_6b world_size 1 num_heads 32 num_kv_heads 2 num_layers 28 hidden_size 4096 vocab_size 65024 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 1024 output_length 1 gpu_peak_mem(gb) 12.742 build_time(s) 0 tokens_per_sec 5.22 percentile95(ms) 192.648 percentile99(ms) 192.727 **latency(ms) 191.697** compute_cap sm86 quantization QuantMode.0 generation_time(ms) 0.013 total_generated_tokens 0.0 generation_tokens_per_second 0.0 How can we get rest(second) token latency? is it generation_time/total_generated_tokens = 8062.768/511 = 15.77 ?

Thanks! BR

additional notes

No

nv-guomingz commented 3 months ago

Hi @kaiyux ,would u please take a look this question?

RobinJYM commented 3 months ago

Any Insights?

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."