potential bug about matmul quantization process?

OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

MIT License

689 stars 53 forks source link

the matmul process between query_states and key_states here has transpose function on key_states:

so in this case, the per_token quantization of activations on the [-1] dimension here fits well.

But in the matmul process between attention and value_states here, _there is no transpose function on value_states after pertoken quantization function :

this causes the matmul between (a,b) and (c,d), former is quantized at a dimension but the latter is quantized at c dimension, which seems wrong( I mean the latter should be quantized at d dimension so that the quantized matmul can be really accelerated on hardware)

OpenGVLab / OmniQuant

potential bug about matmul quantization process? #38