FasterDecoding / SnapKV

200 stars 8 forks source link

Grouped query attention implementation #4

Closed guozhiyu closed 7 months ago

guozhiyu commented 7 months ago

Thank you for your nice work and sharing code. Grouped query attention is used in Mistral and Mixtral models. However, I found the implementation in snapkv_utils.py is for multi-head attention, it may not be correct for grouped query attention.

leeyeehoo commented 7 months ago

Thank you for your comment! Since I refactored the codebase, you can refer to the monkey patch. We change the order of repeat_kv so it should be correct. You can run the notebook example and test with mistral.