Open PengWenChen opened 1 month ago
Thanks for the question. Our method mainly focused on long-context sequence scenarios where input is usually much longer than output and benefited generation speed. We didn't consider the compression along generation stage. I believe other work like H2O also compress along generation.
Hi, thanks for your great work!
It's impressive to compress the long prompt KVs into a constant length. I'm wondering whether the scenario here also consider the case that generation responses > maximum compacity?
It always goes to ln127 only during prefilling stage, and during generation stage it always goes to ln131. Is my understanding correct? https://github.com/FasterDecoding/SnapKV/blob/main/snapkv/monkeypatch/mistral_hijack_4_37.py#L127-L133