Open JackChuang opened 2 months ago
AttentionStore - decouple positional encoding analysis https://bytedance.larkoffice.com/wiki/DrKGwjYZ8icduzkyxyIcQ1imnXe
Understood how decoupling positional encoding works.
Goal: Implement a working version for getting numbers (quick > perfect)
Tony and I went through the experiment and implementation together. Currently, the fastest way to get end-to-end results is the version where we preserve the pre- and post-ROPE KV cache, which we previously considered for a patent. So we started with this approach, and the code is in this branch: https://github.com/bd-iaas-us/vllm/tree/horenc/as-patent-double-kvbuffer. My current results show that duplicating the KV cache adds an additional 5% slowdown, but it avoids the need to recalculate ROPE each time. Compared to the 30-40% overhead we observed for ROPE calculations in previous microbenchmarking, this is a significant improvement. Additionally, other implementations are more complex than initially anticipated and will require more in-depth work to modify:
Check AttentionStore paper and see if the performance would be good or not.