FasterDecoding / SnapKV

200 stars 8 forks source link

why not use the last token for kv cache compression #25

Open Arist12 opened 2 days ago

Arist12 commented 2 days ago

Thanks for this interesting work. I have the following question after reading it:

In the observation experiment, the hit rate is computed by:

the overlap rates between important attention features of input sequence (those with high average attention weights) identified by each window and the actual ones used by generation.

Here, I believe the actual ones used by generation is derived by the last input token, which is also in the last window. So, one very straightforward thought from me is why not directly use the attention score from the last token for kv cache compression?