Open canamika27 opened 3 months ago
Hi @Tracin woud u plz take a look on it firstly?
@Tracin / @nv-guomingz -- Any update ?
Thanks for your patience. We've assigned the dedicated engineer on this issue.
@canamika27 Sorry for late response, what is the y-axis stands for? Is it tokens per sec? Do you have raw data?
System Info
I am currently testing TensorRT LLM Version 0.11.0.dev2024052800 and nvidia-modelopt version 0.11.2 on 2 x H100 GPUS
Who can help?
I had earlier raised an issue with AWQ performance https://github.com/NVIDIA/TensorRT-LLM/issues/1722 so as per suggestion given I tried AWQ with tp_size=1 and FP16 with tp_size=1 & tp_size=2 (for llama3-8B) still I am getting low throughput for AWQ at batch size>=8. How to reproduce the results shared in TensorRT-Model-Optimizer benchmark.md (https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/benchmark.md)
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Below code used for building engine in FP16 with to_size=1 :
python convert_checkpoint.py --model_dir /code/tensorrt_llm/Meta-Llama-3-8B-Instruct/ --output_dir ./tllm_checkpoint_1gpu_tp1 --dtype float16 --tp_size 1
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_tp1 --output_dir ./tmp/llama/8B/trt_engines/fp16/1-gpu/ --gemm_plugin float16 --paged_kv_cache disable --context_fmha enable --gpt_attention_plugin float16 --max_batch_size 128 --max_input_len 2048 --max_output_len 2048
Below code used for building engine in FP16 with to_size=2 :
python convert_checkpoint.py --model_dir /code/tensorrt_llm/Meta-Llama-3-8B-Instruct/ --output_dir ./tllm_checkpoint_2gpu_tp2 --dtype float16 --tp_size 2
trtllm-build --checkpoint_dir ./tllm_checkpoint_2gpu_tp2 --output_dir ./tmp/llama/8B/trt_engines/fp16/2-gpu/ --gemm_plugin float16 --paged_kv_cache disable --context_fmha enable --gpt_attention_plugin float16 --max_batch_size 128 --max_input_len 2048 --max_output_len 2048
Below code used for building engine in AWQ tp_size=1:
python quantize.py --model_dir /code/tensorrt_llm/Meta-Llama-3-8B-Instruct/ --dtype float16 --qformat int4_awq --awq_block_size 128 --output_dir ./quantized_int4-awq --calib_size 32 --tp_size 1
trtllm-build --checkpoint_dir ./quantized_int4-awq --output_dir ./tmp/llama/8B/trt_engines/int4_AWQ/1-gpu/ --gemm_plugin auto \ --paged_kv_cache disable --context_fmha enable --max_batch_size 128 --max_input_len 2048 --max_output_len 2048
For Benchmarking I ran C++ benchmark with bash script shared in repo
Expected behavior
AWQ should have higher throughput than FP16 on batch size > =8 as reported in TensoRT-model-optimizer benchmark results https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/benchmark.md
actual behavior
Till batch size 4 AWQ perform better than FP16 but after that performance degraded.
additional notes
N/A