请问有支持量化版本的qwen架构的计划吗？

xiningnlp commented 8 months ago

xiningnlp commented 8 months ago

https://github.com/alipay/PainlessInferenceAcceleration/blob/main/pia/lookahead/examples/qwen_example.py

chenliangjyj commented 8 months ago

试了下 https://huggingface.co/Qwen/Qwen-7B-Chat-Int4 是可以载入的只需要修改模型的地址 https://github.com/alipay/PainlessInferenceAcceleration/blob/23198bb4a74393d46be302f3cbefcbdf84edfa08/pia/lookahead/examples/qwen_example.py#L19

因为他的模型是gptq量化的所以需要安装 auto-gptq optimum的依赖。flash attn 相关的cuda extension 因为需要适配我们的场景模改，目前还在内部测试中(目前框架中也有一些fuse op会提供加速)。后续也会将常见的量化算法 smooth quant, gptq, awq等离线量化以及在线推理的全流程合并到框架中。

我们的框架跟transformers有比较好的兼容性，transformers的功能理论上稍加改动就可以适配。

alipay / PainlessInferenceAcceleration

请问有支持量化版本的qwen架构的计划吗？ #1