OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
626 stars 49 forks source link

Is activation get quantized on-the-fly? #74

Closed XA23i closed 2 months ago

XA23i commented 3 months ago

In inference, the scales and zero points in activation are adjusted dynamically according to inputs. They do not get fixed in inference. Does it?

Alvant commented 3 months ago

@XA23i Yes, I believe you are right. Activations are really quantized on the fly...

So, personally, I quantized only weights (w4a16g128 setting). Also, even if scales and zero points were fixed, it seems that GPTQ does not support activation quantization at all...

XA23i commented 3 months ago

thanks. By the way, if Activations are quantized on the fly, how will the latency be compared to fixed activation quantization?

ChenMnZ commented 2 months ago

@XA23i We only explore the deployment of weight-only quantization in this study due to that W4A4 and W6A6 quantization methods lack out-of-the-box hardware support.

Alvant commented 2 months ago

@XA23i I also guess that the main reason why activation quantization on the fly might not be a good idea is because of the possible hardware specifics. This definitely also introduces more overhead compared to quantization with fixed scales and zeros. But the exact overhead depends on the way how such quantization statistics are acquired (for example, it might be something like min/max computation as in OmniQuant, or one can find optimal scales for each activation using grid search).

Just in case, there is also some explanation of difference between on the fly (dynamic) quantization and static one here: https://huggingface.co/docs/optimum/concept_guides/quantization (search for "Post training dynamic quantization").

XA23i commented 2 months ago

I see, thank you.