Open wanzhenchn opened 1 month ago
Our team proposed QQQ, also known as W4A8, which has been implemented in vLLM at present. However, it has not been open-sourced yet. In the future, we will also compare it with QServe. I believe that just like LMDeploy chose AWQ instead of GPTQ for implementation in W4A16, we usually choose a better one among similar types for implementation. Stay tuned.
Motivation
This library https://github.com/mit-han-lab/qserve introduces W4A8KV4 Quantization method, called (https://arxiv.org/abs/2405.04532) as QoQ in the paper, which delivers performance gains in large-batch compared to other method (like awq-w4a16).
Related resources
No response
Additional context
No response