Closed XA23i closed 2 months ago
@XA23i Yes, I believe you are right. Activations are really quantized on the fly...
So, personally, I quantized only weights (w4a16g128
setting). Also, even if scales and zero points were fixed, it seems that GPTQ does not support activation quantization at all...
thanks. By the way, if Activations are quantized on the fly, how will the latency be compared to fixed activation quantization?
@XA23i We only explore the deployment of weight-only quantization in this study due to that W4A4 and W6A6 quantization methods lack out-of-the-box hardware support.
@XA23i I also guess that the main reason why activation quantization on the fly might not be a good idea is because of the possible hardware specifics. This definitely also introduces more overhead compared to quantization with fixed scales and zeros. But the exact overhead depends on the way how such quantization statistics are acquired (for example, it might be something like min/max computation as in OmniQuant, or one can find optimal scales for each activation using grid search).
Just in case, there is also some explanation of difference between on the fly (dynamic) quantization and static one here: https://huggingface.co/docs/optimum/concept_guides/quantization (search for "Post training dynamic quantization").
I see, thank you.
In inference, the scales and zero points in activation are adjusted dynamically according to inputs. They do not get fixed in inference. Does it?