Infini-AI-Lab / TriForce

[COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
https://infini-ai-lab.github.io/TriForce/
210 stars 12 forks source link

Question about graph verification #8

Closed diaoyingyu closed 3 months ago

diaoyingyu commented 3 months ago

Hi, Thanks for the great work! I'm trying to understand the triforce method, but confused about the middle speculation.

  1. Dose the target model with retrieval-based KV cache need to verify after each draft inference? Can we verify these tokens in parallel? https://github.com/Infini-AI-Lab/TriForce/blob/193811b9e90a60d7d6c6834f978d0ad4a5a77537/utils/decoding.py#L182C1-L190C100
  2. What's the acceptance rate of the middle specutionlation?

Thanks

preminstrel commented 3 months ago

Hello, thanks for your interest in our work!

  1. Yes, you can. In our provided implementation, we set $\gamma_1 = 1$ because we observed that the performance is nearly the same for $\gamma_1 = 2$, and it decreases for larger values of $\gamma_1$. This is due to the low acceptance rate for Llama-68M. To keep things simple, our open-source code uses $\gamma_1 = 1$. If you’d like to try using better draft models with higher acceptance rates, you can directly modify the function linked below. You only need to add an extra inner loop for $\gamma_1$.

  2. As we expected, since the draft model is quite small (68M) and limited local information (StreamingLLM), the acceptance rate is low. It is only about 0.35.

If you have any further questions, feel free to ask.

diaoyingyu commented 3 months ago

Got it! Thanks for your reply :)