Question about graph verification

diaoyingyu commented 4 months ago

Hi, Thanks for the great work! I'm trying to understand the triforce method, but confused about the middle speculation.

Dose the target model with retrieval-based KV cache need to verify after each draft inference? Can we verify these tokens in parallel? https://github.com/Infini-AI-Lab/TriForce/blob/193811b9e90a60d7d6c6834f978d0ad4a5a77537/utils/decoding.py#L182C1-L190C100
What's the acceptance rate of the middle specutionlation?

Thanks

preminstrel commented 4 months ago

Hello, thanks for your interest in our work!

Yes, you can. In our provided implementation, we set $\gamma_1 = 1$ because we observed that the performance is nearly the same for $\gamma_1 = 2$, and it decreases for larger values of $\gamma_1$. This is due to the low acceptance rate for Llama-68M. To keep things simple, our open-source code uses $\gamma_1 = 1$. If you’d like to try using better draft models with higher acceptance rates, you can directly modify the function linked below. You only need to add an extra inner loop for $\gamma_1$.
As we expected, since the draft model is quite small (68M) and limited local information (StreamingLLM), the acceptance rate is low. It is only about 0.35.

If you have any further questions, feel free to ask.

diaoyingyu commented 4 months ago

Got it! Thanks for your reply :)

Infini-AI-Lab / TriForce