Closed JYYHH closed 9 months ago
Thanks to Haoyu for organizing the question. Your understanding is correct. For each prompt, the first token is generated by the verification model, ensuring the completeness and correctness of the key-value pair. Subsequent draft and validation processes are based on the 'past_key_values' obtained from generating the first token. The subsequent draft model will use the full model to verify the correct key-value cache.
For convenience let's assume the LLM first generate a token "a" for the token sequence, now the token sequence is [a,]. Then the draft model generate several tokens sequentially and the token sequence becomes [a, b, c, d, e]. And the LLM will in parallel verify it, let's then assume "b" and "c" are correct but LLM generates "f" after "c" (It's saying that "d" is incorrect). So before the second round of draft model generating, the token sequence becomes [a, b, c, f].
I just want to make sure in this case: