Performance issue with batching

System Info

x86_64, Debian 11, L4 GPU

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Install tensorrt_llm 0.13.0

Build an llama3.1 8b engine with trtllm-build

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int4_awq --output_dir ./tmp/llama/8B/trt_engines/int4_awq/1-gpu  --gpt_attention_plugin auto  --gemm_plugin auto  --max_num_tokens 32768 --max_batch_size 8 --logits_dtype=float32

Create a input.text file with batch_size=1.

Run benchmark

python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --run_profiling --tokenizer_dir ./llama-3.1-8b-probability-finetuned --max_input_length=100000 --input_file input.text

Modify input.text and make it batch size = 2/4/8
Rerun benchmark for each batch size

Expected behavior

The latency shouldn't increase linearly with batch size, which means batch_size=2 shouldn't be 2x slower than batch_size=1.

actual behavior

The latency increase linearly with batch size.

batch_size: 1, avg latency of 1 iterations: : 0.5734167098999023 sec

batch_size: 2, avg latency of 1 iterations: : 1.1703827381134033 sec

batch_size: 4, avg latency of 1 iterations: : 2.7013044357299805 sec

additional notes

This bug makes TRT-LLM totally unusable in production, so please treat it as P0

NVIDIA / TensorRT-LLM

Performance issue with batching #2466