Closed je1lee closed 9 months ago
The previous report in the Readme about LLaMA2-chat 70B showed a speedup of 2.72x in an experimental setup with 8x RTX 3090 GPUs, at FP16 precision. We also conducted tests on 4xA100 40G GPUs, achieving a speedup of 3.01x. These tests were carried out on MT-bench (which is identical to Medusa), by running gen_baseline_answer_llama2chat.py and gen_ea_answer_llama2chat.py to generate two jsonl files that recorded outputs and wall times. The speedup was then calculated using speed.py. We have uploaded the two output files, which you can find in ./outs. We hope you can provide your testing code and output files to help identify any issues. Below, we list some potential problems.
Other processes were running during the test.
The correct chat template was not used. In the previous version of the code, gen_ea_answer_llama2chat.py determined the template through the parameter model-id, which must contain 'llama-2' (not 'llama_2' or 'llama2'). If your model-id does not include 'llama-2', using the wrong chat template can lead to abnormal model outputs and affect the acceleration effect of EAGLE. (We have now modified the code, so you can freely specify the model-id.)
The length of the output was not taken into consideration. Comparing wall times directly is inappropriate; you should calculate speed based on the number of generated tokens.
Thanks for reply!
speed 19.349681035315008 speed0 8.240575095688778 ratio 2.3480983803470443
this is the result I tried running gen_baseline_answer_llama2chat.py and gen_ea_answer_llama2chat.py I didn't change any setting but the directory of base model and eagle model.
The result is pretty good though but not as fast x2.72
I actually tried on 2x A100 80GB gpu. Could it be the reason of lower acceleration rate??
Also does current train code in repository available for 70B model???
speed 22.47142512710068 speed0 7.470797443156302 ratio 3.0079018067456578
This is our result on 4*A100 80G GPUs. If your machine is faster, both your speed and speed0 should be quicker than ours, but this is not the case. Therefore, I don't think the difference in machines is causing the discrepancy in the speedup ratio. A possible reason could be that other processes were initiated during the run of EAGLE. Testing on a node without any other tasks might yield consistent results. If the problem persists, please feel free to contact us, and we hope you can provide the output files.
I noticed a significant difference in the speedup ratio under your test settings. What caused this issue?
The training code can be found in the ./train directory.
Thanks for the answer!! Like you said there might be another process running on the GPU I used speed 25.494217695700268 speed0 8.976360899293706 ratio x2.8401507004588296 this I got from without any disturbance.
Additionally for the train code I guess it is hard coded with Llama 13b head. Should it be changed with 70B lm_head? and config also be changed for 70B chat train?
And have you checked the MT-bench score?? I know speculative sampling would maintain base model distribution but for sure
I'm glad to hear that your issue has been resolved!
Thank you for your reminder. Indeed, the training code was hard coded. Now, the training-related code (./train/* and ./model/cnet.py) has been updated. They can now automatically load lm_head from the basepath. The basepath parameter is the path of the original LLM, such as meta-llama/Llama-2-70b-chat-hf, and the configpath parameter is the path of the configuration file. We have provided configuration files for both LLaMA2-chat and Vicuna. You can find them in ./train based on their names. You can adjust some parameters according to your training environment, such as the batch size, to enhance training efficiency.
We did not check the MT-bench score because it can be theoretically proven that EAGLE does not alter the distribution of LLM, so checking the MT-bench score is meaningless.
Thanks for the answer!! Like you said there might be another process running on the GPU I used speed 25.494217695700268 speed0 8.976360899293706 ratio x2.8401507004588296 this I got from without any disturbance.
Additionally for the train code I guess it is hard coded with Llama 13b head. Should it be changed with 70B lm_head? and config also be changed for 70B chat train?
And have you checked the MT-bench score?? I know speculative sampling would maintain base model distribution but for sure
Personally, I think that although the MT-bench score is approximately the same in theory, it seems that there is still a certain gap in practice, after all, it is rejection sampling. Have you actually tested the MT-bench score?
Thanks for the answer!! Like you said there might be another process running on the GPU I used speed 25.494217695700268 speed0 8.976360899293706 ratio x2.8401507004588296 this I got from without any disturbance. Additionally for the train code I guess it is hard coded with Llama 13b head. Should it be changed with 70B lm_head? and config also be changed for 70B chat train? And have you checked the MT-bench score?? I know speculative sampling would maintain base model distribution but for sure
Personally, I think that although the MT-bench score is approximately the same in theory, it seems that there is still a certain gap in practice, after all, it is rejection sampling. Have you actually tested the MT-bench score?
It has been long time after I've tested but the score was reproduced as far I remembered
I tried to reproduce the result on 2 A100 80GB with Llama2 70B chat. But the speed acceleration was only ~x1.4. Is this originated my base_model generation was too fast?
Can I get a average number of tokens generated per each base_model(Llama2 70B chat at this case) forward?? or any other metric used to validate the speed??