NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.71k stars 996 forks source link

Performance issue with batching #2466

Open ShuaiShao93 opened 2 days ago

ShuaiShao93 commented 2 days ago

System Info

x86_64, Debian 11, L4 GPU

Who can help?

No response

Information

Tasks

Reproduction

  1. Install tensorrt_llm 0.13.0
  2. Build an llama3.1 8b engine with trtllm-build
    trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_int4_awq --output_dir ./tmp/llama/8B/trt_engines/int4_awq/1-gpu  --gpt_attention_plugin auto  --gemm_plugin auto  --max_num_tokens 32768 --max_batch_size 8 --logits_dtype=float32
  3. Create a input.text file with batch_size=1.
  4. Run benchmark
    python3 TensorRT-LLM/examples/run.py --engine_dir=./tmp/llama/8B/trt_engines/int4_awq/1-gpu --max_output_len 1 --run_profiling --tokenizer_dir ./llama-3.1-8b-probability-finetuned --max_input_length=100000 --input_file input.text
  5. Modify input.text and make it batch size = 2/4/8
  6. Rerun benchmark for each batch size

Expected behavior

The latency shouldn't increase linearly with batch size, which means batch_size=2 shouldn't be 2x slower than batch_size=1.

actual behavior

The latency increase linearly with batch size.

batch_size: 1, avg latency of 1 iterations: : 0.5734167098999023 sec
batch_size: 2, avg latency of 1 iterations: : 1.1703827381134033 sec
batch_size: 4, avg latency of 1 iterations: : 2.7013044357299805 sec

additional notes

This bug makes TRT-LLM totally unusable in production, so please treat it as P0

ShuaiShao93 commented 1 day ago

I just tried 0.14.0 and it's still the same