AnswerDotAI / cold-compress

Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of GPT-Fast, a simple, PyTorch-native generation codebase.
https://www.answer.ai/posts/2024-08-01-cold-compress.html
BSD 3-Clause "New" or "Revised" License
85 stars 8 forks source link

Compute Heavy Hitters KV-Cache Eviction Policy #3

Closed griff4692 closed 3 months ago

griff4692 commented 5 months ago

See this section of the writeup.

Most of the work involves figuring how best to keep track of cumulative attention scores and using them to compute "heavy hitters" in the KV-Cache.

As much as possible, let's try each algorithm described in: Scissorhands, H2O, FlexGen, and SnapKV.

This ticket should allow for these methods to be implemented and have preliminary results comparing them.

Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers requires training so is out of scope for now, unless you are very interested in it.