SafeAILab / EAGLE

Official Implementation of EAGLE-1 and EAGLE-2
https://arxiv.org/pdf/2406.16858
Apache License 2.0
737 stars 74 forks source link

error 'accept_length' in Eagle1 or 2? #95

Closed haiduo closed 4 weeks ago

haiduo commented 1 month ago

https://github.com/SafeAILab/EAGLE/blob/35854779f60db028ec646d945cacbdebdce490bb/eagle/evaluation/gen_ea_alpha_vicuna.py#L93

It looks like a bug? The calculation of accept_length only considers the last step of the last conversation, not the average result. The results I reproduced are basically accept_length=2. Experimental settings: LLM=vicuna-7b, test: MT-bench

Lucas-TY commented 1 month ago

They didn't release the average accept length script, you can simply dump them into jsonl file and calculate them yourself.

haiduo commented 1 month ago

They didn't release the average accept length script, you can simply dump them into jsonl file and calculate them yourself.

Yes, I used the modified script to calculate the average length, but the result was 2.0. But the paper lists different from mine, as follows:
image And I suspect that the calculation of n-α has similar problems. This is my reproduced results: image

yanjunplay commented 1 month ago

Hi @haiduo, curious, did you have any update on this? :-)

haiduo commented 1 month ago

Hi @haiduo, curious, did you have any update on this? :-)

Hello. I try other devices(different GPUs) and the conclusions are basically the same as mine. I also look at the implementations of other open source frameworks and they are the same as mine. So I have reason to believe that the author either doesn't open source the complete actual test codes or his paper results are problematic. In addition, the baseline and eagle results in the table above are reversed, and the unit is token/s.

yanjunplay commented 1 month ago

Thanks @haiduo for replying me. I just checked the Spec Bench dashboard https://github.com/hemingkx/Spec-Bench/blob/main/Leaderboard.md from the link you shared. Their "Accepted Tokens" numbers for EAGLE are all 3+. You mean the logic there was correct? Then I am a bit confused how do they get 3+ while we can only get ~2 here for EAGLE 2? Have you tried the Spec-Bench scripts? Although Spec Bench did the bench mark on EAGLE 1 but I will be surprised that EAGLE 2 is so much worse then EAGLE 1. I also would like to debug together.

haiduo commented 1 month ago

Thank you for your comment @yanjunplay . I don't reproduce the results of Spec Bench in time (but I am going to do so), I just look at the experiment of that code. In addition, the above results are problematic when using the code of eagle2 to test the accept length directly. I use the code of eagle1 (according to the author's previous reply in the issue). If there is a problem with my reproduction, it may be that I don't use chain speculation (used by the paper), but only use eagle1's "gen_ea_alpha_vicuna.py". In fact, its test is a tree with 26 nodes, that is, tree speculation. But even so, why can the values ​​of accept rate (0-alpha, 1-alpha, 2-alpha) be correct? I am confused. BTW, the author's open-source eagle2 can only train and test the speedup results currently, so the results I reproduced are based on eagle1 without any changes. Finally, I need to ask you about the question I raised before. The author implements it differently in eagle2 and eagle1. I don't know if q(x)=1 still meets the same distribution assumption of speculative sampling.

yanjunplay commented 1 month ago

@haiduo do you use wechat? Maybe we can quickly discuss a bit.

haiduo commented 1 month ago

@haiduo do you use wechat? Maybe we can quickly discuss a bit. Mine wechat account: macazi

That's good!

qwedaq commented 1 month ago

Thank you for your comment @yanjunplay . I don't reproduce the results of Spec Bench in time (but I am going to do so), I just look at the experiment of that code. In addition, the above results are problematic when using the code of eagle2 to test the accept length directly. I use the code of eagle1 (according to the author's previous reply in the issue). If there is a problem with my reproduction, it may be that I don't use chain speculation (used by the paper), but only use eagle1's "gen_ea_alpha_vicuna.py". In fact, its test is a tree with 26 nodes, that is, tree speculation. But even so, why can the values ​​of accept rate (0-alpha, 1-alpha, 2-alpha) be correct? I am confused. BTW, the author's open-source eagle2 can only train and test the speedup results currently, so the results I reproduced are based on eagle1 without any changes. Finally, I need to ask you about the question I raised before. The author implements it differently in eagle2 and eagle1. I don't know if q(x)=1 still meets the same distribution assumption of speculative sampling.

Hi @haiduo, were you able to get any answer as to why q(x)=1.0 in EAGLE2?

haiduo commented 1 month ago

Hi @qwedaq, although we don't receive a reply from the author, we deduce it later and find that in the non-repeat sampling mode of Eagle2, q(x)=1.0 is a special case of speculative decoding, which only applies to Eagle2. It is not reasonable for Eagle1. Therefore, Eagle2 should have no problem doing this in theory, but I haven't had time to try it out to see how the actual generated quality is. You can try to use other benchmarks to measure its score.

qwedaq commented 1 month ago

Hi @qwedaq, although we don't receive a reply from the author, we deduce it later and find that in the non-repeat sampling mode of Eagle2, q(x)=1.0 is a special case of speculative decoding, which only applies to Eagle2. It is not reasonable for Eagle1. Therefore, Eagle2 should have no problem doing this in theory, but I haven't had time to try it out to see how the actual generated quality is. You can try to use other benchmarks to measure its score.

Thank you for your quick response. I am a bit new to speculative decoding can you please elaborate on what you mean by non-repeat sampling model of EAGLE2?

haiduo commented 1 month ago

Hi @qwedaq, although we don't receive a reply from the author, we deduce it later and find that in the non-repeat sampling mode of Eagle2, q(x)=1.0 is a special case of speculative decoding, which only applies to Eagle2. It is not reasonable for Eagle1. Therefore, Eagle2 should have no problem doing this in theory, but I haven't had time to try it out to see how the actual generated quality is. You can try to use other benchmarks to measure its score.

Thank you for your quick response. I am a bit new to speculative decoding can you please elaborate on what you mean by non-repeat sampling model of EAGLE2?

Firstly, you may need to read the two papers: "Fast Inference from Transformers via Speculative Decoding" and "Accelerating Large Language Model Decoding with Speculative Sampling". Secondly, my understanding of "sampling without replacement" or "non-repeat" is similar to what we describe in probability statistics: each time a sample is drawn, whether accepted or not, it is excluded from the total sample pool before the next round of sampling. Hope to help you.

qwedaq commented 1 month ago

Got it. Will read the papers you mentioned. Thank you again :)