Thank you for your nice work and sharing code. Grouped query attention is used in Mistral and Mixtral models. However, I found the implementation in snapkv_utils.py is for multi-head attention, it may not be correct for grouped query attention.
Thank you for your comment! Since I refactored the codebase, you can refer to the monkey patch. We change the order of repeat_kv so it should be correct. You can run the notebook example and test with mistral.
Thank you for your nice work and sharing code. Grouped query attention is used in Mistral and Mixtral models. However, I found the implementation in
snapkv_utils.py
is for multi-head attention, it may not be correct for grouped query attention.