casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.41k stars 160 forks source link

can autoawq support int2,int3,int8 quantization? #435

Open ArlanCooper opened 2 months ago

ArlanCooper commented 2 months ago

can autoawq support int2,int3.int8 quantization? i see it only support int4 quantization now

suparious commented 2 months ago

AWQ takes a different approach, where it leaves part of the model un-quantized, and it quants the rest to 4bit. From reading the whitepaper, there is no longer a benefit when you change above or below 4bits.

Other formats, like exl2 and GGUF are oriented to make a model fit into whatever VRAM / RAM you have available, which is great for testing research and experimentation. AWQ takes a more standardized approach, that may even me more oriented towards production inference. In this case, you design your hardware for the model, rather than the other way around.

For example, I can plan for a 7B AWQ model to fit onto a 12GB GPU and run with nearly full context.

ArlanCooper commented 2 months ago

AWQ takes a different approach, where it leaves part of the model un-quantized, and it quants the rest to 4bit. From reading the whitepaper, there is no longer a benefit when you change above or below 4bits.

Other formats, like exl2 and GGUF are oriented to make a model fit into whatever VRAM / RAM you have available, which is great for testing research and experimentation. AWQ takes a more standardized approach, that may even me more oriented towards production inference. In this case, you design your hardware for the model, rather than the other way around.

For example, I can plan for a 7B AWQ model to fit onto a 12GB GPU and run with nearly full context.

thanks very much, but I still have a question, the whitepaper means which paper? Activation-aware Weight Quantization? why there is no longer a benefit when you change above or below 4bits?

suparious commented 2 months ago

Yes, that paper. If I understand it correctly, although the weights are stored as INT4 the inference is still done at FP16.

If you don't care about the quality of inference, and just need to manage your memory requirements, I think the Exllama2 quantization is intended for that.