TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
The throughput of W4A8_AWQ is lower than that of FP16 and much lower than that of FP8. Is it caused by the testing method or the lower performance of the computing cores in W4A8_AWQ?
System Info
Intel(R) Xeon(R) Platinum 8468 NVIDIA H800-80G TensorRT-LLM version 0.12.0
Who can help?
@Tracin @byshiue
Reproduction
I followed the official procedure for LLama2 7b quantization,and compare the throughput of w4a8, FP8 and FP16.
Results
The throughput of W4A8_AWQ is lower than that of FP16 and much lower than that of FP8. Is it caused by the testing method or the lower performance of the computing cores in W4A8_AWQ?