OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
689 stars 53 forks source link

potential bug about matmul quantization process? #38

Closed brisker closed 7 months ago

brisker commented 10 months ago

the matmul process between query_states and key_states here has transpose function on key_states:

image

so in this case, the per_token quantization of activations on the [-1] dimension here fits well.

But in the matmul process between attention and value_states here, _there is no transpose function on value_states after pertoken quantization function :

image

this causes the matmul between (a,b) and (c,d), former is quantized at a dimension but the latter is quantized at c dimension, which seems wrong( I mean the latter should be quantized at d dimension so that the quantized matmul can be really accelerated on hardware)

ChenMnZ commented 10 months ago

I think the code is right. All of query_states, key_states, and value_states are quantized before matrix multiplication.

You can print their shape before quantization to get a more direct and clear observation.