bd-iaas-us / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
4 stars 1 forks source link

[Feature]: Decouple positional encoding & persisted KV cache #30

Open JackChuang opened 2 months ago

JackChuang commented 2 months ago

Check AttentionStore paper and see if the performance would be good or not.

JackChuang commented 2 months ago

AttentionStore - decouple positional encoding analysis https://bytedance.larkoffice.com/wiki/DrKGwjYZ8icduzkyxyIcQ1imnXe

JackChuang commented 1 month ago

Understood how decoupling positional encoding works.

JackChuang commented 1 month ago

Goal: Implement a working version for getting numbers (quick > perfect)

JackChuang commented 1 month ago
JackChuang commented 2 weeks ago

Tony and I went through the experiment and implementation together. Currently, the fastest way to get end-to-end results is the version where we preserve the pre- and post-ROPE KV cache, which we previously considered for a patent. So we started with this approach, and the code is in this branch: https://github.com/bd-iaas-us/vllm/tree/horenc/as-patent-double-kvbuffer. My current results show that duplicating the KV cache adds an additional 5% slowdown, but it avoids the need to recalculate ROPE each time. Compared to the 30-40% overhead we observed for ROPE calculations in previous microbenchmarking, this is a significant improvement. Additionally, other implementations are more complex than initially anticipated and will require more in-depth work to modify: