How to identify the rest toke latency?

RobinJYM commented 5 months ago

System Info

CPU: INTEL RPL
GPU Name: NVIDIA GTX 3090
TensorRT-LLM: tensorrt_llm==0.11.0.dev2024060400
Container Used: Yes and reproduced in Conda as well
Driver Version: 555.42.02
CUDA Version: 12.5
OS: Ubuntu 24.04
Docker Img: nvidia/cuda:12.5.0-devel-ubuntu22.04

Who can help?

@kaiyux

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

run benchmark.py

Expected behavior

as expected

actual behavior

Hi, Here is the output of --input_output_len "1024,512": [BENCHMARK] model_name chatglm3_6b world_size 1 num_heads 32 num_kv_heads 2 num_layers 28 hidden_size 4096 vocab_size 65024 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 1024 output_length 512 gpu_peak_mem(gb) 13.99 build_time(s) 11.05 tokens_per_sec 62.04 percentile95(ms) 8261.234 percentile99(ms) 8261.234 **latency(ms) 8252.869** compute_cap sm86 quantization QuantMode.0 **generation_time(ms) 8062.798 total_generated_tokens 511.0** generation_tokens_per_second 63.377 and here is the output of --input_output_len "1024,1": [BENCHMARK] model_name chatglm3_6b world_size 1 num_heads 32 num_kv_heads 2 num_layers 28 hidden_size 4096 vocab_size 65024 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 1024 output_length 1 gpu_peak_mem(gb) 12.742 build_time(s) 0 tokens_per_sec 5.22 percentile95(ms) 192.648 percentile99(ms) 192.727 **latency(ms) 191.697** compute_cap sm86 quantization QuantMode.0 generation_time(ms) 0.013 total_generated_tokens 0.0 generation_tokens_per_second 0.0 How can we get rest(second) token latency? is it generation_time/total_generated_tokens = 8062.768/511 = 15.77 ?

Thanks! BR

additional notes

No

nv-guomingz commented 5 months ago

Hi @kaiyux ,would u please take a look this question?

RobinJYM commented 5 months ago

Any Insights?

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."

kaiyux commented 12 hours ago

Hi @RobinJYM , generation_time here means latency of generation stage, so if I understand the question correctly, if you want the latency of "rest tokens apart from the first token", you could just use generation_time in the report.

BTW, please note that gptSessionBenchmark is deprecated because we are not recommending to benchmark static batching anymore. Please use trtllm-bench or gptManagerBenchmark instead. We're actively working on trtllm-bench command to make it stable and ready to reproduce performance numbers.

Please refer to perf-overview.md and cpp benchmark for more details. Thanks a lot for the support.

nv-guomingz commented 10 hours ago

Hi @RobinJYM do u still have further issue or question now? If not, we'll close it soon.

NVIDIA / TensorRT-LLM