Support load Qwen1.5-72B-Chat-GPTQ-Int4 by auto_gptq

OpenCSGs / llm-inference

llm-inference is a platform for publishing and managing llm inference, providing a wide range of out-of-the-box features for model deployment, such as UI, RESTful API, auto-scaling, computing resource management, monitoring, and more.

Apache License 2.0

66 stars 17 forks source link

Support load Qwen1.5-72B-Chat-GPTQ-Int4 by auto_gptq #68

Open SeanHH86 opened 5 months ago

SeanHH86 commented 5 months ago

Run Qwen1.5-72B-Chat-GPTQ-Int4 is much slower than Qwen1.5-72B-Chat by transformer package. Quantited model need load by auto_gptq.

https://github.com/QwenLM/Qwen/blob/main/README_CN.md#%E6%8E%A8%E7%90%86%E6%80%A7%E8%83%BD

depenglee1707 commented 5 months ago

Please take a try with the llama.cpp integration. see this example: https://github.com/OpenCSGs/llm-inference/blob/main/models/text-generation--Qwen1.5-7B-Chat-GGUF.yaml