dilab-zju / self-speculative-decoding

Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
Apache License 2.0
117 stars 8 forks source link

KV cache footprint #4

Closed JYYHH closed 9 months ago

JYYHH commented 9 months ago

For convenience let's assume the LLM first generate a token "a" for the token sequence, now the token sequence is [a,]. Then the draft model generate several tokens sequentially and the token sequence becomes [a, b, c, d, e]. And the LLM will in parallel verify it, let's then assume "b" and "c" are correct but LLM generates "f" after "c" (It's saying that "d" is incorrect). So before the second round of draft model generating, the token sequence becomes [a, b, c, f].

I just want to make sure in this case:

  1. Verification for [a,], [a, b], [a, b, c], [a, b, c, d] and [a, b, c, d, e] using LLM can not and will not use the KV-cache of the draft model's internal hidden layers' result.
  2. But after the LLM says [a, b, c, f] is correct, the KV-cache for [a, b, c, f] can be shared between both draft model and LLM itself in the next runs.
junzhang-zj commented 9 months ago

Thanks to Haoyu for organizing the question. Your understanding is correct. For each prompt, the first token is generated by the verification model, ensuring the completeness and correctness of the key-value pair. Subsequent draft and validation processes are based on the 'past_key_values' obtained from generating the first token. The subsequent draft model will use the full model to verify the correct key-value cache.