NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.3k stars 927 forks source link

AWQ performance issue for higher batches #1757

Open canamika27 opened 3 months ago

canamika27 commented 3 months ago

System Info

I am currently testing TensorRT LLM Version 0.11.0.dev2024052800 and nvidia-modelopt version 0.11.2 on 2 x H100 GPUS

Who can help?

I had earlier raised an issue with AWQ performance https://github.com/NVIDIA/TensorRT-LLM/issues/1722 so as per suggestion given I tried AWQ with tp_size=1 and FP16 with tp_size=1 & tp_size=2 (for llama3-8B) still I am getting low throughput for AWQ at batch size>=8. How to reproduce the results shared in TensorRT-Model-Optimizer benchmark.md (https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/benchmark.md)

image image

Information

Tasks

Reproduction

Below code used for building engine in FP16 with to_size=1 :

python convert_checkpoint.py --model_dir /code/tensorrt_llm/Meta-Llama-3-8B-Instruct/ --output_dir ./tllm_checkpoint_1gpu_tp1 --dtype float16 --tp_size 1

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_tp1 --output_dir ./tmp/llama/8B/trt_engines/fp16/1-gpu/ --gemm_plugin float16 --paged_kv_cache disable --context_fmha enable --gpt_attention_plugin float16 --max_batch_size 128 --max_input_len 2048 --max_output_len 2048

Below code used for building engine in FP16 with to_size=2 :

python convert_checkpoint.py --model_dir /code/tensorrt_llm/Meta-Llama-3-8B-Instruct/ --output_dir ./tllm_checkpoint_2gpu_tp2 --dtype float16 --tp_size 2

trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_tp2 --output_dir ./tmp/llama/8B/trt_engines/fp16/2-gpu/ --gemm_plugin float16 --paged_kv_cache disable --context_fmha enable --gpt_attention_plugin float16 --max_batch_size 128 --max_input_len 2048 --max_output_len 2048

Below code used for building engine in AWQ tp_size=1:

python quantize.py --model_dir /code/tensorrt_llm/Meta-Llama-3-8B-Instruct/ --dtype float16 --qformat int4_awq --awq_block_size 128 --output_dir ./quantized_int4-awq --calib_size 32 --tp_size 1

trtllm-build --checkpoint_dir ./quantized_int4-awq --output_dir ./tmp/llama/8B/trt_engines/int4_AWQ/1-gpu/ --gemm_plugin auto \ --paged_kv_cache disable --context_fmha enable --max_batch_size 128 --max_input_len 2048 --max_output_len 2048

For Benchmarking I ran C++ benchmark with bash script shared in repo

Expected behavior

AWQ should have higher throughput than FP16 on batch size > =8 as reported in TensoRT-model-optimizer benchmark results https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/benchmark.md

actual behavior

Till batch size 4 AWQ perform better than FP16 but after that performance degraded.

additional notes

N/A

nv-guomingz commented 3 months ago

Hi @Tracin woud u plz take a look on it firstly?

canamika27 commented 3 months ago

@Tracin / @nv-guomingz -- Any update ?

nv-guomingz commented 2 months ago

Thanks for your patience. We've assigned the dedicated engineer on this issue.

Tracin commented 2 months ago

@canamika27 Sorry for late response, what is the y-axis stands for? Is it tokens per sec? Do you have raw data?