NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
https://nvidia.github.io/TensorRT-LLM
Apache License 2.0
8.6k stars 978 forks source link

Performance of W4A8 throughput on Hopper GPU. #2300

Open zkf331 opened 1 month ago

zkf331 commented 1 month ago

System Info

Intel(R) Xeon(R) Platinum 8468 NVIDIA H800-80G TensorRT-LLM version 0.12.0

Who can help?

@Tracin @byshiue

Reproduction

I followed the official procedure for LLama2 7b quantization,and compare the throughput of w4a8, FP8 and FP16.

python3 benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-2-7b-hf token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000 > ./benchmarks/datasets/synthetic_128_128.txt

trtllm-bench --model meta-llama/Llama-2-7b-hf build --tp_size 1 --quantization FP8 --dataset ./benchmarks/datasets/synthetic_128_128.txt

trtllm-bench --model meta-llama/Llama-2-7b-hf throughput --dataset ./benchmarks/datasets/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1

Results

Throughput (tokens/sec) Model Input/Output Lengths TP FP8 W4A8_AWQ FP16
llama-2-7b 128/128 1 18758 10146 11116

The throughput of W4A8_AWQ is lower than that of FP16 and much lower than that of FP8. Is it caused by the testing method or the lower performance of the computing cores in W4A8_AWQ?

anaivebird commented 4 days ago

What is your quantize.py and trtllm_build command?