Closed muzi0111 closed 8 months ago
Hi @muzi0111,
Thanks for your interest in our project.
About the assertion error, I'm assuming you are referring to L235 in quant.py. The quantization method applied to KV-Cache is group quant with the granularity of per head. Therefore, this assertion is to ensure the last dimension (which will be the reduction dimension in group quant) has the shape of _headdim. 128 here is widely used _headdim in newly released models for efficiency consideration.
To resolve this error, I think replacing 128 with head_dim will be a good choice.
I attempted the W4A4 operation on the OPT-350M model and was able to obtain the corresponding results. However, after switching the model to 2.7B, I encountered a mismatch error at line 238 in quant.py. Upon printing, I discovered the size to be ([32, 2048, 160]), whereas, for the 350M model, it was displayed as 16, 2048, 128. How should I resolve this error?