FFY0 / AdaKV

The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
https://arxiv.org/abs/2407.11550
Other
40 stars 0 forks source link

Hope for the Ada-PyramidKV-GQA result #4

Closed FdyCN closed 4 days ago

FdyCN commented 1 week ago

Thank you for sharing these valuable experiment. I am now evaluate the accurancy about SnapKV\Pyramid and your methods. Basically, Pyramid is a little better than SnapKV, so I think that Ada-Pyramid-GQA maybe better than Ada-SnapKV-GQA.

So if you guys are working on it. I'm glad to wait for your test result about Ada-PyramidKV-GQA.

I think dynamic sparse kv-cache has lower cost than PTQ\QAT or dynamic quatizetion or other training-awared compression. So AdaKV is a higher priority method for me to implement.

Thank you again!

FFY0 commented 6 days ago

Thanks for your interest in our work! We plan to conduct comprehensive evaluation of cache eviction methods under GQA support. The results will be included in our paper upon completion.

Preliminary results on Mistral-7B-v0.2 indicate that PyramidKV outperforms SnapKV under smaller budgets, regardless of GQA support. Similarly, Ada-PyramidKV also outperforms Ada-SnapKV under smaller budgets. Therefore, the choice of method is recommended to align with the actual budget compression ratio requirements.

Mistral-7B-v0.2 (LongBench Ave. Score) B = 128 B=1024
SnapKV 34.55 41.35
PyramidKV 34.69 41.01
Ada-SnapKV 35.48 41.61
Ada-PyramidKV 35.83 41.22
FdyCN commented 5 days ago

Thanks for your interest in our work! We plan to conduct comprehensive evaluation of cache eviction methods under GQA support. The results will be included in our paper upon completion.

Preliminary results on Mistral-7B-v0.2 indicate that PyramidKV outperforms SnapKV under smaller budgets, regardless of GQA support. Similarly, Ada-PyramidKV also outperforms Ada-SnapKV under smaller budgets. Therefore, the choice of method is recommended to align with the actual budget compression ratio requirements.

Mistral-7B-v0.2 (LongBench Ave. Score) B = 128 B=1024 SnapKV 34.55 41.35 PyramidKV 34.69 41.01 Ada-SnapKV 35.48 41.61 Ada-PyramidKV 35.83 41.22

@FFY0 thank you for the reply.

BTW,I found maybe it's a bug in SnapKV-GQA.

first, repeat kv before kv_cluster.update_kv here: https://github.com/FFY0/AdaKV/blob/8c5a31f085f953bc706006dc9270087698872562/adaptive_snapkv/monkeypatch/fixed_llama_hijack.py#L211

then, repeat_kv again in kv_cluster.update_kv here: https://github.com/FFY0/AdaKV/blob/8c5a31f085f953bc706006dc9270087698872562/adaptive_snapkv/monkeypatch/snapkv_utils.py#L220

so it will cause double repeat, and shape infer will be wrong.

Please double check about it. Thanks.

FFY0 commented 5 days ago

Sorry for the confusion. I noticed that you are running the cache eviction for the LLama model. Currently, we only support GQA for the Mistral-7B-v0.2 model. The LWM-Text-Chat-1M model based on fixed_llama_hijack mentioned in our paper is MHA-based, so it does not natively support GQA.

However, with the appropriate modifications to fixed_llama_hijack or adaptive_llama_hijack, GQA support can also be enabled. We will release the GQA support to LLama-based models in the near future. Currently, you can test GQA functionality using the Mistral-7B-v0.2 model.

FdyCN commented 5 days ago

Sorry for the confusion. I noticed that you are running the cache eviction for the LLama model. Currently, we only support GQA for the Mistral-7B-v0.2 model. The LWM-Text-Chat-1M model based on fixed_llama_hijack mentioned in our paper is MHA-based, so it does not natively support GQA.

However, with the appropriate modifications to fixed_llama_hijack or adaptive_llama_hijack, GQA support can also be enabled. We will release the GQA support to LLama-based models in the near future. Currently, you can test GQA functionality using the Mistral-7B-v0.2 model.

@FFY0 Thank you for the reply. I modified some related codes in fix_llama_hijack , and the results seem OK.