Thanks for this interesting work. I have the following question after reading it:
In the observation experiment, the hit rate is computed by:
the overlap rates between important attention features of input sequence (those with high average attention weights) identified by each window and the actual ones used by generation.
Here, I believe the actual ones used by generation is derived by the last input token, which is also in the last window. So, one very straightforward thought from me is why not directly use the attention score from the last token for kv cache compression?
Thanks for this interesting work. I have the following question after reading it:
In the observation experiment, the hit rate is computed by:
Here, I believe the actual ones used by generation is derived by the last input token, which is also in the last window. So, one very straightforward thought from me is why not directly use the attention score from the last token for kv cache compression?