NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
7.37k stars 796 forks source link

How to get or calculate First token latency and second / next token latency in python runtime and cpp runtime benchmarking ? #1871

Open GunturuSandeep opened 2 days ago

GunturuSandeep commented 2 days ago

I am doing benchmarking in python runtime and in cpp runtime. But there are no first token and second token latency values.

Please help me with the process to calculate them.

TensorRT-LLM version : v0.10.0

GPU : Nvidia L40S

sample output of cpp runtime :

/app/tensorrt_llm/benchmarks/cpp/gptSessionBenchmark --engine_dir ./engines/llama_7b/fp16/fp16-bs32-beam1-normalusecase/ --warm_up 1 --batch_size 1 --num_runs 10 --input_output_len 1024,128 --beam_width 1

Benchmarking done. Iteration: 10, duration: 25.61 sec. Latencies: [2561.47, 2561.20, 2560.63, 2562.01, 2557.57, 2562.35, 2562.13, 2562.02, 2562.01, 2562.08] [BENCHMARK] batch_size 1 input_length 1024 output_length 128 latency(ms) 2561.35 tokensPerSec 49.97 generation_time(ms) 2492.30 generationTokensPerSec 51.36 gpu_peak_mem(gb) 44.75

sample output of python runtime :

[BENCHMARK] model_name gpt_350m world_size 1 num_heads 16 num_kv_heads 16 num_layers 24 hidden_size 1024 vocab_size 51200 precision float16 batch_size 1 input_length 60 output_length 20 gpu_peak_mem(gb) 4.2 build_time(s) 25.67 tokens_per_sec 483.54 percentile95(ms) 41.537 percentile99(ms) 42.102 latency(ms) 41.362 compute_cap sm80

QiJune commented 2 days ago

@kaiyux Could you please take a look? Thanks

GunturuSandeep commented 2 days ago

Thanks @QiJune , for the response. @kaiyux could you help me out here. I just need the process how to calculate first token and second token latency in python and c++ runtime.