HandH1998 / QQQ

QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
https://arxiv.org/pdf/2406.09904
91 stars 8 forks source link

[New Model Supported] MiniCPM-2.4B #5

Closed RanchiZhao closed 4 months ago

RanchiZhao commented 4 months ago

I want to know if MiniCPM is planned to support QQQ? Or do I need to do something specific to facilitate MiniCPM's support for QQQ quantization? AFAK, MiniCPM and LLaMA are generally similar.

HandH1998 commented 4 months ago

We currently do not plan to support this model, but we welcome you to submit a pull request. You should do something like https://github.com/HandH1998/QQQ/blob/main/QQQ/smooth/models/quant_llama.py,https://github.com/HandH1998/QQQ/blob/main/QQQ/gptq/models/llama.py and https://github.com/HandH1998/QQQ/blob/main/QQQ/smooth/export.py. These are what you should mainly modify. If you have any questions, welcome to discuss with us.

RanchiZhao commented 4 months ago

@HandH1998 I find that maybe MiniCPM-2.4B's architecture is not available for qlinear_marlin? due to that the output feature of linear is 5760, which cannot be divided by 256. "ValueError: infeatures must be divisible by 128 and outfeatures by 256."

HandH1998 commented 4 months ago

I followed the limitations of w4a16 kernel. But it seems that the limitations can be relaxed as https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/vllm/model_executor/layers/quantization/marlin.py#L43-L47 does. I will confirm it.