Zefan-Cai / PyramidKV

MIT License
192 stars 1 forks source link

Merge into vLLM, is it possible? #14

Open PatchouliTIS opened 1 week ago

PatchouliTIS commented 1 week ago

Solid idea and Ingenious code implementations, Great work! Have you considered implementing KV Compression operations on KV Cache in the vLLM framework?

Zefan-Cai commented 5 days ago

Thank you for your support!

We are currently working on that. The vLLM implementation of KV cache compression would take time, as it is challenging.

PatchouliTIS commented 2 days ago

I've also been thinking lately about how to implement this in vLLM.

In general, vLLM updates the KV Cache using memory-friendly functions in torch._C: image https://github.com/vllm-project/vllm/blob/717f4bcea036a049e86802b3a05dd6f7cd17efc8/vllm/attention/backends/flash_attn.py#L188

If we take a simpler perspective, wouldn't it be possible to implement the compression of the kv_cache directly in FlashAttentionImpl or a higher level class before vLLM updates the kv_cache, and leave the PageAttention slice to vLLM to run? I would also like to get some insight from you if possible.

Zefan-Cai commented 2 days ago

I believe that it is possible, as long as we build a class like update_kv in this repo, and integrate it into FlashAttentionImpl, then it may work.

Previously I have seen some people working on sparse kv cache compression in vLLM as below: https://github.com/vllm-project/vllm/issues/5751 It seems like they are working in a very deep class.

I believe that you mentioned implementation is far more decent.

BTW, would you be interesting in being a contributor of this project to make an implementation of PyramidKV in vLLM? We would be more than help if you could help.