Quantized INT4/Q4 model?

SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

MIT License

7.98k stars 415 forks source link

Quantized INT4/Q4 model? #18

Closed JianbangZ closed 11 months ago

JianbangZ commented 11 months ago

How are the gguf weights quantized to INT4? is there a script similar to llama.cpp to convert to fp16 weigths to q4_0? Please share more details about INT4 model.

YixinSong-e commented 11 months ago

Our quantization script is aligned with llama.cpp, you can directly use it by ./build/bin/quantize --pure $PATH_TO_ORIGIN_ MODEL $Q4_ MODEL_ NAME Q4_0

hodlen commented 11 months ago

Thanks for your feedback! We have added model quantization under README: https://github.com/SJTU-IPADS/PowerInfer#quantization.