FasterDecoding / SnapKV

200 stars 8 forks source link

What happens to the total KV length > max-compacity length during response generation? #23

Open PengWenChen opened 1 month ago

PengWenChen commented 1 month ago

Hi, thanks for your great work!

It's impressive to compress the long prompt KVs into a constant length. I'm wondering whether the scenario here also consider the case that generation responses > maximum compacity?

It always goes to ln127 only during prefilling stage, and during generation stage it always goes to ln131. Is my understanding correct? https://github.com/FasterDecoding/SnapKV/blob/main/snapkv/monkeypatch/mistral_hijack_4_37.py#L127-L133

WendyH1108 commented 1 month ago

Thanks for the question. Our method mainly focused on long-context sequence scenarios where input is usually much longer than output and benefited generation speed. We didn't consider the compression along generation stage. I believe other work like H2O also compress along generation.