Closed FdyCN closed 4 days ago
Thanks for your interest in our work! We plan to conduct comprehensive evaluation of cache eviction methods under GQA support. The results will be included in our paper upon completion.
Preliminary results on Mistral-7B-v0.2 indicate that PyramidKV outperforms SnapKV under smaller budgets, regardless of GQA support. Similarly, Ada-PyramidKV also outperforms Ada-SnapKV under smaller budgets. Therefore, the choice of method is recommended to align with the actual budget compression ratio requirements.
Mistral-7B-v0.2 (LongBench Ave. Score) | B = 128 | B=1024 |
---|---|---|
SnapKV | 34.55 | 41.35 |
PyramidKV | 34.69 | 41.01 |
Ada-SnapKV | 35.48 | 41.61 |
Ada-PyramidKV | 35.83 | 41.22 |
Thanks for your interest in our work! We plan to conduct comprehensive evaluation of cache eviction methods under GQA support. The results will be included in our paper upon completion.
Preliminary results on Mistral-7B-v0.2 indicate that PyramidKV outperforms SnapKV under smaller budgets, regardless of GQA support. Similarly, Ada-PyramidKV also outperforms Ada-SnapKV under smaller budgets. Therefore, the choice of method is recommended to align with the actual budget compression ratio requirements.
Mistral-7B-v0.2 (LongBench Ave. Score) B = 128 B=1024 SnapKV 34.55 41.35 PyramidKV 34.69 41.01 Ada-SnapKV 35.48 41.61 Ada-PyramidKV 35.83 41.22
@FFY0 thank you for the reply.
BTW,I found maybe it's a bug in SnapKV-GQA.
first, repeat kv before kv_cluster.update_kv here: https://github.com/FFY0/AdaKV/blob/8c5a31f085f953bc706006dc9270087698872562/adaptive_snapkv/monkeypatch/fixed_llama_hijack.py#L211
then, repeat_kv again in kv_cluster.update_kv here: https://github.com/FFY0/AdaKV/blob/8c5a31f085f953bc706006dc9270087698872562/adaptive_snapkv/monkeypatch/snapkv_utils.py#L220
so it will cause double repeat, and shape infer will be wrong.
Please double check about it. Thanks.
Sorry for the confusion. I noticed that you are running the cache eviction for the LLama model. Currently, we only support GQA for the Mistral-7B-v0.2 model. The LWM-Text-Chat-1M model based on fixed_llama_hijack
mentioned in our paper is MHA-based, so it does not natively support GQA.
However, with the appropriate modifications to fixed_llama_hijack
or adaptive_llama_hijack
, GQA support can also be enabled. We will release the GQA support to LLama-based models in the near future. Currently, you can test GQA functionality using the Mistral-7B-v0.2 model.
Sorry for the confusion. I noticed that you are running the cache eviction for the LLama model. Currently, we only support GQA for the Mistral-7B-v0.2 model. The LWM-Text-Chat-1M model based on
fixed_llama_hijack
mentioned in our paper is MHA-based, so it does not natively support GQA.However, with the appropriate modifications to
fixed_llama_hijack
oradaptive_llama_hijack
, GQA support can also be enabled. We will release the GQA support to LLama-based models in the near future. Currently, you can test GQA functionality using the Mistral-7B-v0.2 model.
@FFY0 Thank you for the reply. I modified some related codes in fix_llama_hijack
, and the results seem OK.
Thank you for sharing these valuable experiment. I am now evaluate the accurancy about SnapKV\Pyramid and your methods. Basically, Pyramid is a little better than SnapKV, so I think that Ada-Pyramid-GQA maybe better than Ada-SnapKV-GQA.
So if you guys are working on it. I'm glad to wait for your test result about Ada-PyramidKV-GQA.
I think dynamic sparse kv-cache has lower cost than PTQ\QAT or dynamic quatizetion or other training-awared compression. So AdaKV is a higher priority method for me to implement.
Thank you again!