Open RobinJYM opened 5 months ago
Hi @kaiyux ,would u please take a look this question?
Any Insights?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 15 days."
Hi @RobinJYM , generation_time
here means latency of generation stage, so if I understand the question correctly, if you want the latency of "rest tokens apart from the first token", you could just use generation_time
in the report.
BTW, please note that gptSessionBenchmark
is deprecated because we are not recommending to benchmark static batching anymore. Please use trtllm-bench
or gptManagerBenchmark
instead. We're actively working on trtllm-bench
command to make it stable and ready to reproduce performance numbers.
Please refer to perf-overview.md and cpp benchmark for more details. Thanks a lot for the support.
Hi @RobinJYM do u still have further issue or question now? If not, we'll close it soon.
System Info
Who can help?
@kaiyux
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
run benchmark.py
Expected behavior
as expected
actual behavior
Hi, Here is the output of --input_output_len "1024,512":
[BENCHMARK] model_name chatglm3_6b world_size 1 num_heads 32 num_kv_heads 2 num_layers 28 hidden_size 4096 vocab_size 65024 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 1024 output_length 512 gpu_peak_mem(gb) 13.99 build_time(s) 11.05 tokens_per_sec 62.04 percentile95(ms) 8261.234 percentile99(ms) 8261.234 **latency(ms) 8252.869** compute_cap sm86 quantization QuantMode.0 **generation_time(ms) 8062.798 total_generated_tokens 511.0** generation_tokens_per_second 63.377
and here is the output of --input_output_len "1024,1":[BENCHMARK] model_name chatglm3_6b world_size 1 num_heads 32 num_kv_heads 2 num_layers 28 hidden_size 4096 vocab_size 65024 precision float16 batch_size 1 gpu_weights_percent 1.0 input_length 1024 output_length 1 gpu_peak_mem(gb) 12.742 build_time(s) 0 tokens_per_sec 5.22 percentile95(ms) 192.648 percentile99(ms) 192.727 **latency(ms) 191.697** compute_cap sm86 quantization QuantMode.0 generation_time(ms) 0.013 total_generated_tokens 0.0 generation_tokens_per_second 0.0
How can we get rest(second) token latency? is it generation_time/total_generated_tokens = 8062.768/511 = 15.77 ?Thanks! BR
additional notes
No