casper-hansen / AutoAWQ

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
https://casper-hansen.github.io/AutoAWQ/
MIT License
1.68k stars 202 forks source link

Support 3-bit and 2-bit quantization with the FLUTE kernel. #564

Open radi-cho opened 2 months ago

radi-cho commented 2 months ago

Hi,

I would like to propose to add the FLUTE kernel as a backend version for fast 3-bit and 2-bit quantization. I think we can use a FluteLinear module with its corresponding 3-bit and 2-bit packing as a new linear implementation. Then it could be substituted into the rest of the AutoAWQ codebase. If the maintainers believe this could be a valuable addition, I will volunteer to open a pull request.

cc @HanGuo97

casper-hansen commented 2 months ago

Hi @radi-cho, I do find it interesting to add support for lower-bit quantization. The only caveat, especially for 2-bit, is that extreme low-bit quantized models may need more extensive methods for preserving quality.

HanGuo97 commented 2 months ago

Thanks for bringing that up @radi-cho and @casper-hansen!

I agree that 2-bit is a bit too "aggressive" to be useful in practice. That being said, many of the ongoing research seems to be looking at sub-4-bit quantization. In that sense, this could be useful purely as a research prototype.

NamburiSrinath commented 1 week ago

Hi @radi-cho, @HanGuo97 and @casper-hansen

Thanks for this amazing initiative. I am wondering if there's any place to look up at the work being done on other bit quantization (2, 3, 8 bits like GPTQ)

When I tried with other than 4 bit, I got this error (and I believe you might be aware)

Traceback (most recent call last):
  File "/home/ubuntu/Compress_Align/compress_models_awq.py", line 18, in <module>
    model.quantize(tokenizer, quant_config=quant_config)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/awq/models/base.py", line 231, in quantize
    self.quantizer.quantize()
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/awq/quantize/quantizer.py", line 187, in quantize
    self._apply_quant(self.modules[i], named_linears)
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/awq/quantize/quantizer.py", line 227, in _apply_quant
    q_linear = q_linear_module.from_linear(
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/awq/modules/linear/gemm.py", line 145, in from_linear
    awq_linear = cls(
  File "/home/ubuntu/anaconda3/envs/compress_align/lib/python3.9/site-packages/awq/modules/linear/gemm.py", line 93, in __init__
    raise NotImplementedError("Only 4-bit are supported for now.")
NotImplementedError: Only 4-bit are supported for now. 
HanGuo97 commented 1 week ago

Just want to make sure I understand the question. Are you talking about algorithms for, say, 3-bit quantization, or fused implementations of it?