Open PatchouliTIS opened 1 week ago
Thank you for your support!
We are currently working on that. The vLLM implementation of KV cache compression would take time, as it is challenging.
I've also been thinking lately about how to implement this in vLLM.
In general, vLLM updates the KV Cache using memory-friendly functions in torch._C:
https://github.com/vllm-project/vllm/blob/717f4bcea036a049e86802b3a05dd6f7cd17efc8/vllm/attention/backends/flash_attn.py#L188
If we take a simpler perspective, wouldn't it be possible to implement the compression of the kv_cache directly in FlashAttentionImpl or a higher level class before vLLM updates the kv_cache, and leave the PageAttention slice to vLLM to run? I would also like to get some insight from you if possible.
I believe that it is possible, as long as we build a class like update_kv in this repo, and integrate it into FlashAttentionImpl, then it may work.
Previously I have seen some people working on sparse kv cache compression in vLLM as below: https://github.com/vllm-project/vllm/issues/5751 It seems like they are working in a very deep class.
I believe that you mentioned implementation is far more decent.
BTW, would you be interesting in being a contributor of this project to make an implementation of PyramidKV in vLLM? We would be more than help if you could help.
Solid idea and Ingenious code implementations, Great work! Have you considered implementing KV Compression operations on KV Cache in the vLLM framework?