OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
663 stars 50 forks source link

Quick Clarification Question on C4 PPL #32

Closed HanGuo97 closed 9 months ago

HanGuo97 commented 9 months ago

Hi, first of all, thanks for the amazing paper and repo --- I've learned a lot from it. The comprehensive results table has also been useful as a service to the community.

I just want to ask a very quick clarification question. The repo has two versions of C4 evaluation ("c4" and "c4-new"). For the C4 perplexities in the paper (e.g., Table 10), I'd like to confirm which version of C4 evaluation the numbers correspond to (i.e., "c4" or "c4-new")?

Thanks in advance for your time!

ChenMnZ commented 9 months ago

Thanks for your interest.

In the paper, the reported C4 perplexity is from c4.

HanGuo97 commented 9 months ago

Thanks for the quick confirmation!

HanGuo97 commented 9 months ago

Hi, sorry for following up the thread with an additional clarification question:

For all W3A16 g128 entries in Table 1 and Table 10, would you mind specifying the number of bits per parameter when taking into account of overheads? Here are some calculation on my end, it'd be great if you could confirm (or correct) them:

Thanks again for your time -- this is super useful!

[1] All of them store one FP16 scale and one 3-bit integer zero-point per group, hence the overhead (with group size 128) is (3. + 16.) / 128. ~= 0.148 bits/param.

[2] OmniQuant stores two FP16 numbers per group, hence the overhead (with group size 128) is 2 x 16bits / 128 = 0.25 bits/param

ChenMnZ commented 9 months ago

Sorry for the confusion. Actually, the number of bits of OmniQuant is same as RTN/GPTQ/AWQ. For W3A16 g128, OmniQuant stores one FP16 scaling factor and one 3-bit integer zero-point per group.

Specifically, you can refer https://github.com/OpenGVLab/OmniQuant/blob/834847adcee9575b89cd14ed2a3623c770743b4a/quantize/quantizer.py#L146 , which shows that zero-points are transfered into integer through round.

Additionally, the quantization models of OmniQuant can also be deployed by existing AWQ/GPTQ kernels due to the entirely same architecture. For example, we also try to achieve really memory reduction with GPTQ kernel in our code. https://github.com/OpenGVLab/OmniQuant/blob/834847adcee9575b89cd14ed2a3623c770743b4a/quantize/omniquant.py#L264

HanGuo97 commented 9 months ago

Ah I see, thanks for the clarifications!

One additional question (and likely a dumb question). I understand that OmniQuant shares the same amount of overhead as RTN/GPTQ/AWQ. But is the overhead calculation correct to you (i.e., ~=0.148 bits/param)? I noticed that the SqueezeLLM paper [1] listed these as 3.24/3.25 bits/param depending on the base model size, which is a bit confusing to me. Did I miss anything?

[1] https://arxiv.org/pdf/2306.07629.pdf

ChenMnZ commented 9 months ago

Yeah, ~=0.148 bits/param is the correct bits number for W3A16 g128 asymmetric uniform quantization.

I have also noticed the reported bits number at SqueezeLLM. In my view, SqueezeLLM paper reported the wrong bits number for GPTQ/AWQ.

HanGuo97 commented 9 months ago

Got it; a huge thanks again for taking the time to answer those questions!!