Performance of W4A8 throughput on Hopper GPU.

System Info

Intel(R) Xeon(R) Platinum 8468 NVIDIA H800-80G TensorRT-LLM version 0.12.0

Who can help?

@Tracin @byshiue

Reproduction

I followed the official procedure for LLama2 7b quantization，and compare the throughput of w4a8, FP8 and FP16.

python3 benchmarks/cpp/prepare_dataset.py --stdout --tokenizer meta-llama/Llama-2-7b-hf token-norm-dist --input-mean 128 --output-mean 128 --input-stdev 0 --output-stdev 0 --num-requests 3000 > ./benchmarks/datasets/synthetic_128_128.txt

trtllm-bench --model meta-llama/Llama-2-7b-hf build --tp_size 1 --quantization FP8 --dataset ./benchmarks/datasets/synthetic_128_128.txt

trtllm-bench --model meta-llama/Llama-2-7b-hf throughput --dataset ./benchmarks/datasets/synthetic_128_128.txt --engine_dir /tmp/meta-llama/Llama-2-7b-hf/tp_1_pp_1

Results

Throughput (tokens/sec)	Model	Input/Output Lengths	TP	FP8	W4A8_AWQ	FP16
llama-2-7b	128/128	1	18758	10146	11116

The throughput of W4A8_AWQ is lower than that of FP16 and much lower than that of FP8. Is it caused by the testing method or the lower performance of the computing cores in W4A8_AWQ？

NVIDIA / TensorRT-LLM

Performance of W4A8 throughput on Hopper GPU. #2300

System Info

Who can help?

Reproduction

Results