SafeAILab / EAGLE

Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)
https://arxiv.org/pdf/2406.16858
Apache License 2.0
780 stars 79 forks source link

Can I get a test setting?? #5

Closed je1lee closed 9 months ago

je1lee commented 9 months ago

I tried to reproduce the result on 2 A100 80GB with Llama2 70B chat. But the speed acceleration was only ~x1.4. Is this originated my base_model generation was too fast?

Can I get a average number of tokens generated per each base_model(Llama2 70B chat at this case) forward?? or any other metric used to validate the speed??

Liyuhui-12 commented 9 months ago

The previous report in the Readme about LLaMA2-chat 70B showed a speedup of 2.72x in an experimental setup with 8x RTX 3090 GPUs, at FP16 precision. We also conducted tests on 4xA100 40G GPUs, achieving a speedup of 3.01x. These tests were carried out on MT-bench (which is identical to Medusa), by running gen_baseline_answer_llama2chat.py and gen_ea_answer_llama2chat.py to generate two jsonl files that recorded outputs and wall times. The speedup was then calculated using speed.py. We have uploaded the two output files, which you can find in ./outs. We hope you can provide your testing code and output files to help identify any issues. Below, we list some potential problems.

je1lee commented 9 months ago

Thanks for reply!

speed 19.349681035315008 speed0 8.240575095688778 ratio 2.3480983803470443

this is the result I tried running gen_baseline_answer_llama2chat.py and gen_ea_answer_llama2chat.py I didn't change any setting but the directory of base model and eagle model.

The result is pretty good though but not as fast x2.72

I actually tried on 2x A100 80GB gpu. Could it be the reason of lower acceleration rate??

je1lee commented 9 months ago

Also does current train code in repository available for 70B model???

Liyuhui-12 commented 9 months ago

speed 22.47142512710068 speed0 7.470797443156302 ratio 3.0079018067456578

This is our result on 4*A100 80G GPUs. If your machine is faster, both your speed and speed0 should be quicker than ours, but this is not the case. Therefore, I don't think the difference in machines is causing the discrepancy in the speedup ratio. A possible reason could be that other processes were initiated during the run of EAGLE. Testing on a node without any other tasks might yield consistent results. If the problem persists, please feel free to contact us, and we hope you can provide the output files.

I noticed a significant difference in the speedup ratio under your test settings. What caused this issue?

The training code can be found in the ./train directory.

je1lee commented 9 months ago

Thanks for the answer!! Like you said there might be another process running on the GPU I used speed 25.494217695700268 speed0 8.976360899293706 ratio x2.8401507004588296 this I got from without any disturbance.

Additionally for the train code I guess it is hard coded with Llama 13b head. Should it be changed with 70B lm_head? and config also be changed for 70B chat train?

And have you checked the MT-bench score?? I know speculative sampling would maintain base model distribution but for sure

Liyuhui-12 commented 9 months ago

I'm glad to hear that your issue has been resolved!

Thank you for your reminder. Indeed, the training code was hard coded. Now, the training-related code (./train/* and ./model/cnet.py) has been updated. They can now automatically load lm_head from the basepath. The basepath parameter is the path of the original LLM, such as meta-llama/Llama-2-70b-chat-hf, and the configpath parameter is the path of the configuration file. We have provided configuration files for both LLaMA2-chat and Vicuna. You can find them in ./train based on their names. You can adjust some parameters according to your training environment, such as the batch size, to enhance training efficiency.

We did not check the MT-bench score because it can be theoretically proven that EAGLE does not alter the distribution of LLM, so checking the MT-bench score is meaningless.

haiduo commented 2 months ago

Thanks for the answer!! Like you said there might be another process running on the GPU I used speed 25.494217695700268 speed0 8.976360899293706 ratio x2.8401507004588296 this I got from without any disturbance.

Additionally for the train code I guess it is hard coded with Llama 13b head. Should it be changed with 70B lm_head? and config also be changed for 70B chat train?

And have you checked the MT-bench score?? I know speculative sampling would maintain base model distribution but for sure

Personally, I think that although the MT-bench score is approximately the same in theory, it seems that there is still a certain gap in practice, after all, it is rejection sampling. Have you actually tested the MT-bench score?

je1lee commented 2 months ago

Thanks for the answer!! Like you said there might be another process running on the GPU I used speed 25.494217695700268 speed0 8.976360899293706 ratio x2.8401507004588296 this I got from without any disturbance. Additionally for the train code I guess it is hard coded with Llama 13b head. Should it be changed with 70B lm_head? and config also be changed for 70B chat train? And have you checked the MT-bench score?? I know speculative sampling would maintain base model distribution but for sure

Personally, I think that although the MT-bench score is approximately the same in theory, it seems that there is still a certain gap in practice, after all, it is rejection sampling. Have you actually tested the MT-bench score?

It has been long time after I've tested but the score was reproduced as far I remembered