[Feature]: Decouple positional encoding & persisted KV cache

JackChuang commented 2 months ago

Check AttentionStore paper and see if the performance would be good or not.

AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving https://arxiv.org/pdf/2403.19708v2

JackChuang commented 2 months ago

AttentionStore - decouple positional encoding analysis https://bytedance.larkoffice.com/wiki/DrKGwjYZ8icduzkyxyIcQ1imnXe

JackChuang commented 1 month ago

Understood how decoupling positional encoding works.

JackChuang commented 1 month ago

Goal: Implement a working version for getting numbers (quick > perfect)

Try to run ShareGPT and understand the flow of multi-turn conversations.
Design and implement (functionality correctness)

JackChuang commented 1 month ago

Started looking into how to reproduce the multi-turn truncation and kv$ reuse for reproducing decoupling positional encoding. (found the way and testing it)
Set environment for this project (Try not to interfere with developing flashdecoding++ project because it will take lot of time recompiling the projects)

JackChuang commented 2 weeks ago

Tony and I went through the experiment and implementation together. Currently, the fastest way to get end-to-end results is the version where we preserve the pre- and post-ROPE KV cache, which we previously considered for a patent. So we started with this approach, and the code is in this branch: https://github.com/bd-iaas-us/vllm/tree/horenc/as-patent-double-kvbuffer. My current results show that duplicating the KV cache adds an additional 5% slowdown, but it avoids the need to recalculate ROPE each time. Compared to the 30-40% overhead we observed for ROPE calculations in previous microbenchmarking, this is a significant improvement. Additionally, other implementations are more complex than initially anticipated and will require more in-depth work to modify:

bd-iaas-us / vllm

[Feature]: Decouple positional encoding & persisted KV cache #30