Zefan-Cai / PyramidKV

The Official Implementation of PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
https://arxiv.org/pdf/2406.02069
MIT License
487 stars 45 forks source link

Settings and implementations of baselines #16

Open bingps opened 2 months ago

bingps commented 2 months ago

Hi, thanks for your insightful work! But it seems that some settings and implementations of baselines are not so proper and fair?

  1. Window size. In run_longbench.py: 221, PyramidKV uses window size of 8 while others use 32. Maybe Pyramid and SnapKV should use the same window size?

What's more, in their original experiment settings, SnapKV uses 16 out of 1K tokens as window, while StreamingLLM only uses 4 initial tokens with remaining 1K or 2K as local tokens. Maybe the relative size of local window and other tokens should be considered. (In your KV size=128 experiment, it seems that StreamingLLM uses 96 initial tokens ...)

  1. As for H2O, the current implementation looks just the SnapKV without pooling. It is not the exact implementation of the original paper. Actually, H2O seems to evict/keep same tokens across all layers? The results could be even worse than you reported🤣.
bingps commented 2 months ago

BTW, there are some confusing redundant codes. For example, in run_longbench.py, "DynamicKV" in line 209 is never set. And line 231-236, the lists look unnecessary.

            if not isinstance(window_sizes, list):
                window_sizes = [window_sizes] * layers
            if not isinstance(max_capacity_prompts, list):
                max_capacity_prompts = [max_capacity_prompts] * layers
            if not isinstance(kernel_sizes, list):
                kernel_sizes = [kernel_sizes] * layers
            for i in range(layers):
                model.model.layers[i].self_attn.config.window_size = window_sizes[i]
                model.model.layers[i].self_attn.config.max_capacity_prompt = max_capacity_prompts[i]
                model.model.layers[i].self_attn.config.kernel_size = kernel_sizes[i]
Zefan-Cai commented 2 months ago

Hi, thank you for your careful review of this codebase! It seems like you already know everything of this codebase!

For SnapKV, it makes sense to switch the window size of SnapKV from 32 to 8. I implemented it with window size 32 following their original setting at https://github.com/FasterDecoding/SnapKV/tree/main/experiments/LongBench/config. It seems like they use 32 as window size for experiments, not 16 as you mentioned, at least as shown in their codebase. But using the same of PyramidKV as 8 makes sense, I just had a try, and see that: For SnapKV in mistral-7B-instruct-v0.2 and 128 kv cache size setting, the performance increase from 34.96 (window size = 32) to 35.15 (window size = 8). I would update the results later.

For StreamingLLM, it makes sense to keep initial 4 tokens and leave all remaining budgets as local window. I would update the implementation and results later.

For H2O, I do notice a global statistics guided compression in their paper, but I didn't see the corresponding code in their implementation as https://github.com/FMInference/H2O/blob/main/h2o_hf/utils_lm_eval/modify_llama.py. Not sure if I missed something. My current implementation is generally aligned their implementation.

bingps commented 2 months ago

Thanks for your response! Sorry for my mistake about H2O. I checked its codes and found that there is a significant discrepancy between their implementation and paper, and it cannot work due to diverse bugs. As it's also layer-wise, I agree with you that your implementation is generally aligned with it ✅.

I have noticed your latest updates, current settings (4 initial tokens for SLLM and unified window_size=8 for others) seem to be much more reasonable! Looking forward to the updated experiment results! 👏