KV cache footprint - Githubissues

dilab-zju / self-speculative-decoding

Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**

Apache License 2.0

117 stars 8 forks source link

For convenience let's assume the LLM first generate a token "a" for the token sequence, now the token sequence is [a,]. Then the draft model generate several tokens sequentially and the token sequence becomes [a, b, c, d, e]. And the LLM will in parallel verify it, let's then assume "b" and "c" are correct but LLM generates "f" after "c" (It's saying that "d" is incorrect). So before the second round of draft model generating, the token sequence becomes [a, b, c, f].

I just want to make sure in this case:

Verification for [a,], [a, b], [a, b, c], [a, b, c, d] and [a, b, c, d, e] using LLM can not and will not use the KV-cache of the draft model's internal hidden layers' result.
But after the LLM says [a, b, c, f] is correct, the KV-cache for [a, b, c, f] can be shared between both draft model and LLM itself in the next runs.

dilab-zju / self-speculative-decoding

KV cache footprint #4