Closed griff4692 closed 3 months ago
See this section of the writeup.
Most of the work involves figuring how best to keep track of cumulative attention scores and using them to compute "heavy hitters" in the KV-Cache.
As much as possible, let's try each algorithm described in: Scissorhands, H2O, FlexGen, and SnapKV.
This ticket should allow for these methods to be implemented and have preliminary results comparing them.
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers requires training so is out of scope for now, unless you are very interested in it.
See this section of the writeup.
Most of the work involves figuring how best to keep track of cumulative attention scores and using them to compute "heavy hitters" in the KV-Cache.
As much as possible, let's try each algorithm described in: Scissorhands, H2O, FlexGen, and SnapKV.
This ticket should allow for these methods to be implemented and have preliminary results comparing them.
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers requires training so is out of scope for now, unless you are very interested in it.