Closed RanchiZhao closed 4 months ago
We currently do not plan to support this model, but we welcome you to submit a pull request. You should do something like https://github.com/HandH1998/QQQ/blob/main/QQQ/smooth/models/quant_llama.py,https://github.com/HandH1998/QQQ/blob/main/QQQ/gptq/models/llama.py and https://github.com/HandH1998/QQQ/blob/main/QQQ/smooth/export.py. These are what you should mainly modify. If you have any questions, welcome to discuss with us.
@HandH1998
I find that maybe MiniCPM-2.4B's architecture is not available for qlinear_marlin? due to that the output feature of linear is 5760, which cannot be divided by 256.
"ValueError: infeatures
must be divisible by 128 and outfeatures
by 256."
I followed the limitations of w4a16 kernel. But it seems that the limitations can be relaxed as https://github.com/vllm-project/vllm/blob/70c232f85a9e83421a4d9ca95e6384364271f2bc/vllm/model_executor/layers/quantization/marlin.py#L43-L47 does. I will confirm it.
I want to know if MiniCPM is planned to support QQQ? Or do I need to do something specific to facilitate MiniCPM's support for QQQ quantization? AFAK, MiniCPM and LLaMA are generally similar.