AnswerDotAI / cold-compress

Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of GPT-Fast, a simple, PyTorch-native generation codebase.
https://www.answer.ai/posts/2024-08-01-cold-compress.html
BSD 3-Clause "New" or "Revised" License
72 stars 6 forks source link

SnapKV #42

Open SimJeg opened 1 month ago

SimJeg commented 1 month ago

Hello,

I see SnapKV is used for the Heavy Hitter Prompt Compression strategy. As far as I understand (correct me if I'm wrong), it is also used in the benchmarks reported in the blog post for the Heavy Hitter results.

If that's the case, I think the comparison with L2 norm, recent global etc. is unfair. In SnapKV you use the latest tokens to filter out the KV pairs that won't be used in the next tokens. However this process cannot be re-used if you ask a second question to your LLM.

For instance if you have a document, ask a first question related to the beginning of the document, SnapKV will retrieve the KV pairs at the beginning of the document. If you then ask a second question, you would have to re-run SnapKV to retrieve the KV pairs at the end of the document. Hence SnapKV does not really compress the KV cache as opposed for instance to L2 norm which definitely deletes the KV pairs. SnapKV is great to to accelerate generation, not to reduce the KV cache size in memory (except if the use case is a single interaction with the LLM).

I would make this point more clear in the blog post, or use another strategy for prompt compression heavy hitter (H2O, scissorshands etc.).

Maybe it would be worth distinguishing the context and the task in the evals ? e.g. your compress the context and the ask the question (work only if question comes after the context of)

griff4692 commented 1 month ago

Thanks for sharing your thoughts!

We use SnapKV as our "Heavy Hitter" prompt compressor because it relies on attention for prompt evictions.

The input and outputs for all prompt compressors are the same

I'm not sure I understand the difference in using SnapKV versus L2 Norm. Both filter the prompt at a single point in time.

SnapKV uses the end of the prompt (observation) to select the KV vectors at the beginning of the prompt, yet the KV pairs at the end (observation window) are also kept.

Can you clarify what you mean by having to re-run SnapKV?

I think what you are describing is a limitation of all these token dropping methods which lead to irrevocable information loss and have potentially limited utility in multi-turn long dialogues without major modifications.

SimJeg commented 1 month ago

I think what you are describing is a limitation of all these token dropping methods

Yes and no. All pruning methods are doomed at some point because you loose information. But what makes SnapKV different is the use of the "observation" window.

If your prompt is structured like this: context / question (e.g. the text of a book followed by a question like : what happened in chapter 4), the "observation" will correspond to the question, hence the attention pattern will retrieve the right KV pairs. But if you change the question (e.g. what happened in chapter 5?), the compression will be very different. In contrast the pruned tokens will be almost the same for L2 norm, H2O etc.

Another way to say it, is that if you apply the compression to context only and then ask your question, I'm quite confident SnapKV will be much worse than other eviction methods.

So to me, the only way to use SnapKV beyond benchmarks (e.g. for chat use case, or a QA system over documents) is to re-run it for each new prompt to retrieve the relevant tokens. Generation will be faster because you don't use all KV pairs but you still need to maintain all KV pairs.

For L2 norm or H2O it's not the case, you can compress only once. That's also why these techniques are likely worse... because pruning is more limited than retrieval.