IST-DASLab / marlin

FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Apache License 2.0
575 stars 45 forks source link

Does Marlin support zero-point quantization? #5

Open casper-hansen opened 8 months ago

casper-hansen commented 8 months ago

Dear creators of Marlin

What a huge performance boost these kernels can bring! I’m super excited about this as the open source community has been lacking kernels that scale.

To my question, does Marlin support zero point quantization like we normally get from AutoGPTQ or AutoAWQ?

Best wishes Casper

RonanKMcGovern commented 8 months ago

+1 - would be great to have marlin speed with AWQ perplexity

fergusbarratt commented 7 months ago

+1 - marlin's great, would be amazing to have AWQ support

dalistarh commented 7 months ago

I am a bit confused by this issue. Have you compared the PPL of Marlin models relative to AWQ? Looking at the AWQ paper, I see Wiki2 PPL of 5.60 for LLaMA2-7B AWQ g128. The LLaMA2-7B GPTQ model released by Elias, which works roughly under the same parameters, has PPL 5.27 (for a base PPL of 5.12).

casper-hansen commented 7 months ago

Marlin used a different method for measuring perplexity, so can’t compare the two unfortunately

dalistarh commented 7 months ago

Well, my point is that the above post seems to be assuming that the AWQ PPL is better than the GPTQ version used by Marlin. This might not be the case.

efrantar commented 7 months ago

Hi,

in general, my experience is that when GPTQ is tuned and configured properly (e.g., also uses grid-clipping), results are extremely similar to AWQ. That being said, Marlin is a general fp16xint4 matmul kernel, at the moment supporting symmetric linear quantization either column-wise or at groupsize 128, with fp16 scales. It does not matter how the quantized weights are produced, they could come from GPTQ, AWQ, ZeroQuant or any other quantization method, they just have to follow Marlin's format. I think fixing the zero-point to 8 should cause AWQ to produce Marlin-compatible weights?

Currently, Marlin does not support zero points. With moderately sized groups and grid-clipping (as used by AWQ or our improved GPTQ implementation), the difference between symmetric and asymmetric seemed very small in my tests, maybe <= 0.01 PPL. Zero points stored in fp16 should not be too hard to support, but are probably not worth it from an accuracy standpoint (one could use smaller symmetric groupsize instead). Quantized zero points may bring marginal gains in some cases, but are likely a bit tricky to support without any efficiency drop (already the current version requires quite some care to avoid unfavorable instruction ordering by the compiler in the main loop when using groups).

mobicham commented 6 months ago

Thanks for the amazing work @efrantar !

Regarding the zero-point, it is actually very important to have it especially at low-bits. In fact, the zero-point is more important than the scaling. That is why methods like HQQ optimize for the zero-point.

To give you some perspective on why the zero-point is important, I run two experiments on wikitext, Llama2-7B model, 2-bit quantization, context-size=1024 with HQQ+:

If the group-size for both the scaling and zero-point are the same, it shouldn't be too difficult to add it I think. Really looking forward to a version that supports lower group-sizes as well!