dilab-zju / self-speculative-decoding

Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
Apache License 2.0
125 stars 8 forks source link

Reject Sampling & Adaptive Draft-Exiting Mechanism #19

Closed wutong4012 closed 1 month ago

wutong4012 commented 1 month ago

Very interesting work, thank you for your contribution to the open source community.

  1. In the first two papers, verification is done by probability rejection sampling, but I saw that the implementation of the code directly compares the generated ids. Are these two equivalent? image image https://github.com/dilab-zju/self-speculative-decoding/blob/6c719da65ada6a4cd99ea32e84030caa9aae22c0/decoding.py#L137

  2. I saw the design of Adaptive Draft-Exiting Mechanism. What is the reason for this design? I did not find any relevant theoretical proof or intuitive explanation. image

Please correct me if I'm wrong, thanks.

junzhang-zj commented 1 month ago

@wutong4012 In the greedy setting, this is equivalent, in the sampling setting, you can call sss, https://github.com/dilab-zju/self-speculative-decoding/blob/6c719da65ada6a4cd99ea32e84030caa9aae22c0/decoding.py#L190 As for the adaptive exit mechanism, I think Section 3.4 of the article has already explained it intuitively. It mainly uses the acceptance rate as an anchor point to judge the difficulty of the token, preventing the draft model from predicting difficult tokens that are destined to fail verification. Similarly, let simple tokens be generated by the draft model as much as possible.