Closed wutong4012 closed 4 months ago
@wutong4012 In the greedy setting, this is equivalent, in the sampling setting, you can call sss, https://github.com/dilab-zju/self-speculative-decoding/blob/6c719da65ada6a4cd99ea32e84030caa9aae22c0/decoding.py#L190 As for the adaptive exit mechanism, I think Section 3.4 of the article has already explained it intuitively. It mainly uses the acceptance rate as an anchor point to judge the difficulty of the token, preventing the draft model from predicting difficult tokens that are destined to fail verification. Similarly, let simple tokens be generated by the draft model as much as possible.
Very interesting work, thank you for your contribution to the open source community.
In the first two papers, verification is done by probability rejection sampling, but I saw that the implementation of the code directly compares the generated ids. Are these two equivalent? https://github.com/dilab-zju/self-speculative-decoding/blob/6c719da65ada6a4cd99ea32e84030caa9aae22c0/decoding.py#L137
I saw the design of Adaptive Draft-Exiting Mechanism. What is the reason for this design? I did not find any relevant theoretical proof or intuitive explanation.
Please correct me if I'm wrong, thanks.