HandH1998 / QQQ

QQQ is an innovative and hardware-optimized W4A8 quantization solution.
https://arxiv.org/pdf/2406.09904
59 stars 4 forks source link

Condition to achieve linear speedup? #15

Open jiwonsong-dev opened 1 week ago

jiwonsong-dev commented 1 week ago

I tested latency of QuantLinear forward with various sizes of input and feature sizes. But for token counts from 1 to 1024, I cannot see any speedup compared to AWQ W4A16 kernel and the results were suboptimal to pytorch FP16 Linear in most cases. I tested weight sizes (4096, 4096), (5120, 5120), (6656, 6656), (8192, 8192) which match linear sizes of LLaMA model family on A6000 and RTX3090 GPU. I see the experiments in the paper was taken on A100 GPU. Is there any specific setting or condition to see the speedup aligns with the results on paper?

jiwonsong-dev commented 6 days ago

Overhead of activation quantization using simple PyTorch operation is substantial but the kernel itself is slower than nn.Linear for most cases.

HandH1998 commented 6 days ago

@jiwonsong-dev There is online activation quantization using simple PyTorch in QuantLinear, which is very slow. The GEMM speedup in our paper is evaluated without activation quantization. If you want to reproduce the speedup, please refer to https://github.com/HandH1998/QQQ/issues/2#issuecomment-2179921604. By the way, the activation quantizaiton is fused into element-wise kernel like rmsnorm in our vllm PR, and it will not affect the inference speed much.

jiwonsong-dev commented 6 days ago

Is the kernel integrated to vLLM is the same one in the repo? I see the QuantLinear slower than nn.Linear for M from 1 to 1024 when N,K are fixed to 4096 even with the quantization overhead not considered.

HandH1998 commented 6 days ago

@jiwonsong-dev The kernel is the same with that in vLLM. If there is no other operations like dtype conversion and reshape in your modified QuantLinear, the QuantLinear should deliver the similar performance with directly using the gemm kernel. Generally, the QuantLinear is only used for the simple inference in our repo. I recommend you to try vLLM for practical inference.

jiwonsong-dev commented 5 days ago

I checked your fork of Marlin repository and saw actual speedup via benchmark codes. Thank you for kind response!

jiwonsong-dev commented 9 hours ago

Is there any specific reason why permutation is different when packing channel quantized weights? Per group follows original Marlin format.