Closed jivanph closed 9 months ago
Thank you for your interest in our work.
As mentioned in Section 4.1 of our paper, we accept draft tokens by checking whether they match (in other words, are identical to) the "true" tokens sampled from the LLM, which ensures that the results of REST are identical to those generated by standard autoregressive generation (refer to L254 in utils.py or L268 in utils.py for code implementation).
We are sorry for the confusion and will revise the phrasing in the next version of the paper.
Thank you so much for your response. It clarified things for me.
Welcome to reopen the issue or open another issue if you have any further questions.
Thank you so much for your contribution to the literature in decoding strategies.
After reading your paper with great attention, I noticed that in the 'Draft acceptance of REST' subsection of the paper you mention that you "check the correctness of the draft token" because you "adopt a similar acceptance strategy compared to the original speculative decoding".
But, from my understanding, in the original speculative decoding the acceptance procedure depends on computing a probability ratio between the large and small model predictions. Since there is no small model in the methodology you propose, I wanted to ask how do you proceed in regards to accepting or not proposed tokens. Also, is there a guarantee that the accepted tokens follow the same distribution as the original (large) model?