Open bingps opened 2 months ago
BTW, there are some confusing redundant codes. For example, in run_longbench.py
, "DynamicKV" in line 209 is never set. And line 231-236, the lists look unnecessary.
if not isinstance(window_sizes, list):
window_sizes = [window_sizes] * layers
if not isinstance(max_capacity_prompts, list):
max_capacity_prompts = [max_capacity_prompts] * layers
if not isinstance(kernel_sizes, list):
kernel_sizes = [kernel_sizes] * layers
for i in range(layers):
model.model.layers[i].self_attn.config.window_size = window_sizes[i]
model.model.layers[i].self_attn.config.max_capacity_prompt = max_capacity_prompts[i]
model.model.layers[i].self_attn.config.kernel_size = kernel_sizes[i]
Hi, thank you for your careful review of this codebase! It seems like you already know everything of this codebase!
For SnapKV, it makes sense to switch the window size of SnapKV from 32 to 8. I implemented it with window size 32 following their original setting at https://github.com/FasterDecoding/SnapKV/tree/main/experiments/LongBench/config. It seems like they use 32 as window size for experiments, not 16 as you mentioned, at least as shown in their codebase. But using the same of PyramidKV as 8 makes sense, I just had a try, and see that: For SnapKV in mistral-7B-instruct-v0.2 and 128 kv cache size setting, the performance increase from 34.96 (window size = 32) to 35.15 (window size = 8). I would update the results later.
For StreamingLLM, it makes sense to keep initial 4 tokens and leave all remaining budgets as local window. I would update the implementation and results later.
For H2O, I do notice a global statistics guided compression in their paper, but I didn't see the corresponding code in their implementation as https://github.com/FMInference/H2O/blob/main/h2o_hf/utils_lm_eval/modify_llama.py. Not sure if I missed something. My current implementation is generally aligned their implementation.
Thanks for your response! Sorry for my mistake about H2O. I checked its codes and found that there is a significant discrepancy between their implementation and paper, and it cannot work due to diverse bugs. As it's also layer-wise, I agree with you that your implementation is generally aligned with it ✅.
I have noticed your latest updates, current settings (4 initial tokens for SLLM and unified window_size=8
for others) seem to be much more reasonable! Looking forward to the updated experiment results! 👏
Hi, thanks for your insightful work! But it seems that some settings and implementations of baselines are not so proper and fair?
run_longbench.py: 221
, PyramidKV uses window size of 8 while others use 32. Maybe Pyramid and SnapKV should use the same window size?What's more, in their original experiment settings, SnapKV uses 16 out of 1K tokens as window, while StreamingLLM only uses 4 initial tokens with remaining 1K or 2K as local tokens. Maybe the relative size of local window and other tokens should be considered. (In your
KV size=128
experiment, it seems that StreamingLLM uses 96 initial tokens ...)