SafeAILab / EAGLE

Official Implementation of EAGLE
https://arxiv.org/pdf/2406.16858
Apache License 2.0
622 stars 59 forks source link

reproducing eagle on mistral-7b-v0.3-instruct #79

Open alcholiclg opened 3 weeks ago

alcholiclg commented 3 weeks ago

Dear Eagle Team:

Hello, and thank you very much for your excellent work for the community. Recently, while attempting to replicate Eagle, I encountered some issues that I have been unable to resolve, and I would greatly appreciate your insights into the possible reasons behind them.

My goal is to replicate the effects of Eagle on mistral-7b-v0.3-instruct.

Here are the settings I used:

  1. For data generation, I employed the ge_data_all_llama2chat.py script, modifying the LLM selection to mistral-7b-v0.3-instruct. Additionally, I altered the conversation template used, removing the system_message component.

  2. During the training phase, I utilized a small model configuration with a batch size (bsz) of 12, 8xH100, and a learning rate (lr) of 18e-5. The training metrics were aligned with the official code, and the training progress is detailed below.

  3. In the testing phase, I initially evaluated the consistency on 80 questions from the vicuna_questions.jsonl file in the qlora codebase. Specifically, I compared the token_id outputs between the LLM and Eagle to assess their alignment. Surprisingly, the consistency was less than 10%. As a benchmark, I conducted tests using the officially provided Vicuna and Llama models, which yielded consistency rates of approximately 87% and 96%, respectively. These figures are significantly higher than my own test results.

Given the above, could you please provide me with some suggestions? I would be extremely grateful for any assistance you can offer.Thank you very much.

image image image

ssm_ids= [1, 2, 3, 4, 5, 6], llm_ids=[1, 2, 4, 5, 6, 7], alignment=33.333%

Liyuhui-12 commented 3 weeks ago

It seems that your test accuracy during training is normal, so I suspect that your training (including data generation) and evaluation might be using different base model weights or templates.

alcholiclg commented 2 weeks ago

It seems that your test accuracy during training is normal, so I suspect that your training (including data generation) and evaluation might be using different base model weights or templates.

Thank you very much for your answer !

  1. After careful investigation, I found that the main problem in my reproduction process was that the tree mask was incorrectly constructed in the custom modeling_mistral.py file(copied from transformers and modified follow your instrution). After fixing this problem, the consistency rate of the output can reach 82%.
  2. Another discovery is that the output distribution of eagle-mistral/vicuna/llama-7B-chat does not seem to be directly aligned with the distribution generated by directly using mistral/vicuna/llama-7B-chat to generate or forward(model.generate() or token by token forward). Under the condition of trying to load all models with fp32 precision, the consistency rate of eagle-mistral/vicuna/llama-7B-chat is around 97%. I am not sure whether this is due to the difference in calculation between the tree decoding process and the valina autoregressive process.
  3. In addition, there is another problem that the acceleration ratio of eagle-mistral that I reproduced can only reach 1.93 compared with the mistral-7b-v0.3-instruct version. Based on the performance of the training process, do you think there may be a problem with the consistency of the large model and the draft model? The test setting is 8 H100 80G and fp16 precision.
ShivangiAg commented 1 week ago

Hi @alcholiclg,I am also working on integrating EAGLE with the Mistral Instruct model. Can you share the code modifications you have made to make it compatible with Mistral? Also, is an average of 1.93 tokens per forward pass the best performance you have achieved with EAGLE on Mistral?

Liyuhui-12 commented 3 days ago

@alcholiclg

Another discovery is that the output distribution of eagle-mistral/vicuna/llama-7B-chat does not seem to be directly aligned with the distribution generated by directly using mistral/vicuna/llama-7B-chat to generate or forward(model.generate() or token by token forward). Under the condition of trying to load all models with fp32 precision, the consistency rate of eagle-mistral/vicuna/llama-7B-chat is around 97%. I am not sure whether this is due to the difference in calculation between the tree decoding process and the valina autoregressive process.

Floating-point calculations do not satisfy associativity, so a+b+c!=a+c+b. The final distribution can be affected by GPU calculations and reduction order. If the probabilities of two tokens are very close, the chosen token may be inconsistent. However, in our tests, under fp32 precision, the Vanilla generation and EAGLE generation in Mt-bench are completely consistent at the discrete token level, except for differences caused by different truncation strategies and maximum lengths. Is your inconsistency due to this reason?

In addition, there is another problem that the acceleration ratio of eagle-mistral that I reproduced can only reach 1.93 compared with the mistral-7b-v0.3-instruct version. Based on the performance of the training process, do you think there may be a problem with the consistency of the large model and the draft model? The test setting is 8 H100 80G and fp16 precision.

In our experiments, when the draft model (LLaMA structure) is inconsistent with the base model (Mixtral 8x7B, MoE structure), the acceptance rate drops significantly. I believe the reason might be the structural inconsistency between the draft model and the base model.