This approach performs quantization to [1|2|3|4|8]bit precision, but doesn't require calibration process (like AutoGPTQ and AWQ). In that sense, it's somewhat similar to Bitsandbytes: one needs to only replace linear layers with HQQLinear.
Based on the metrics provided in the blog, one can see that in 4bit precision the HQQ outperforms BNB in both perplexity and VRAM usage. Depending on the model size, the HQQ ties with AutoGPTQ/AWQ, but does not require any calibration what's so every.
With 2bit precision and HQQ the perplexity isn't great, but much better than GPTQ and thus makes this very low precision somewhat usable.
There is one caveat though: I haven't found any comparison of speed in token/sec between BNB, AutoGPTQ/AWQ and HQQ.
If it's at least on-par, does that mean that my PR #924 is doomed?
One more note: when I tried to torch.compile with AutoGPTQ I got an error. The same was with BNB.
But, based on the docs, HQQ should be totally ok with that.
Hi there 👋
I came across a new quantization technique called Half-Quadratic Quantization. Blog: https://mobiusml.github.io/hqq_blog/ GitHub: https://github.com/mobiusml/hqq/tree/master
This approach performs quantization to [1|2|3|4|8]bit precision, but doesn't require calibration process (like
AutoGPTQ
andAWQ
). In that sense, it's somewhat similar toBitsandbytes
: one needs to only replace linear layers withHQQLinear
.Based on the metrics provided in the blog, one can see that in
4bit
precision theHQQ
outperformsBNB
in both perplexity and VRAM usage. Depending on the model size, the HQQ ties with AutoGPTQ/AWQ, but does not require any calibration what's so every. With2bit
precision and HQQ the perplexity isn't great, but much better than GPTQ and thus makes this very low precision somewhat usable.There is one caveat though: I haven't found any comparison of speed in token/sec between BNB, AutoGPTQ/AWQ and HQQ. If it's at least on-par, does that mean that my PR #924 is doomed?