SafeAILab / EAGLE

Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)
https://arxiv.org/pdf/2406.16858
Apache License 2.0
839 stars 86 forks source link

reproducing eagle on mistral-7b-v0.3-instruct #79

Open alcholiclg opened 6 months ago

alcholiclg commented 6 months ago

Dear Eagle Team:

Hello, and thank you very much for your excellent work for the community. Recently, while attempting to replicate Eagle, I encountered some issues that I have been unable to resolve, and I would greatly appreciate your insights into the possible reasons behind them.

My goal is to replicate the effects of Eagle on mistral-7b-v0.3-instruct.

Here are the settings I used:

  1. For data generation, I employed the ge_data_all_llama2chat.py script, modifying the LLM selection to mistral-7b-v0.3-instruct. Additionally, I altered the conversation template used, removing the system_message component.

  2. During the training phase, I utilized a small model configuration with a batch size (bsz) of 12, 8xH100, and a learning rate (lr) of 18e-5. The training metrics were aligned with the official code, and the training progress is detailed below.

  3. In the testing phase, I initially evaluated the consistency on 80 questions from the vicuna_questions.jsonl file in the qlora codebase. Specifically, I compared the token_id outputs between the LLM and Eagle to assess their alignment. Surprisingly, the consistency was less than 10%. As a benchmark, I conducted tests using the officially provided Vicuna and Llama models, which yielded consistency rates of approximately 87% and 96%, respectively. These figures are significantly higher than my own test results.

Given the above, could you please provide me with some suggestions? I would be extremely grateful for any assistance you can offer.Thank you very much.

image image image

ssm_ids= [1, 2, 3, 4, 5, 6], llm_ids=[1, 2, 4, 5, 6, 7], alignment=33.333%

Liyuhui-12 commented 5 months ago

It seems that your test accuracy during training is normal, so I suspect that your training (including data generation) and evaluation might be using different base model weights or templates.

alcholiclg commented 5 months ago

It seems that your test accuracy during training is normal, so I suspect that your training (including data generation) and evaluation might be using different base model weights or templates.

Thank you very much for your answer !

  1. After careful investigation, I found that the main problem in my reproduction process was that the tree mask was incorrectly constructed in the custom modeling_mistral.py file(copied from transformers and modified follow your instrution). After fixing this problem, the consistency rate of the output can reach 82%.
  2. Another discovery is that the output distribution of eagle-mistral/vicuna/llama-7B-chat does not seem to be directly aligned with the distribution generated by directly using mistral/vicuna/llama-7B-chat to generate or forward(model.generate() or token by token forward). Under the condition of trying to load all models with fp32 precision, the consistency rate of eagle-mistral/vicuna/llama-7B-chat is around 97%. I am not sure whether this is due to the difference in calculation between the tree decoding process and the valina autoregressive process.
  3. In addition, there is another problem that the acceleration ratio of eagle-mistral that I reproduced can only reach 1.93 compared with the mistral-7b-v0.3-instruct version. Based on the performance of the training process, do you think there may be a problem with the consistency of the large model and the draft model? The test setting is 8 H100 80G and fp16 precision.
ShivangiAg commented 5 months ago

Hi @alcholiclg,I am also working on integrating EAGLE with the Mistral Instruct model. Can you share the code modifications you have made to make it compatible with Mistral? Also, is an average of 1.93 tokens per forward pass the best performance you have achieved with EAGLE on Mistral?

Liyuhui-12 commented 5 months ago

@alcholiclg

Another discovery is that the output distribution of eagle-mistral/vicuna/llama-7B-chat does not seem to be directly aligned with the distribution generated by directly using mistral/vicuna/llama-7B-chat to generate or forward(model.generate() or token by token forward). Under the condition of trying to load all models with fp32 precision, the consistency rate of eagle-mistral/vicuna/llama-7B-chat is around 97%. I am not sure whether this is due to the difference in calculation between the tree decoding process and the valina autoregressive process.

Floating-point calculations do not satisfy associativity, so a+b+c!=a+c+b. The final distribution can be affected by GPU calculations and reduction order. If the probabilities of two tokens are very close, the chosen token may be inconsistent. However, in our tests, under fp32 precision, the Vanilla generation and EAGLE generation in Mt-bench are completely consistent at the discrete token level, except for differences caused by different truncation strategies and maximum lengths. Is your inconsistency due to this reason?

In addition, there is another problem that the acceleration ratio of eagle-mistral that I reproduced can only reach 1.93 compared with the mistral-7b-v0.3-instruct version. Based on the performance of the training process, do you think there may be a problem with the consistency of the large model and the draft model? The test setting is 8 H100 80G and fp16 precision.

In our experiments, when the draft model (LLaMA structure) is inconsistent with the base model (Mixtral 8x7B, MoE structure), the acceptance rate drops significantly. I believe the reason might be the structural inconsistency between the draft model and the base model.

alcholiclg commented 4 months ago

Hi @alcholiclg,I am also working on integrating EAGLE with the Mistral Instruct model. Can you share the code modifications you have made to make it compatible with Mistral? Also, is an average of 1.93 tokens per forward pass the best performance you have achieved with EAGLE on Mistral?

Sorry for not being able to respond in time for personal reasons. First, the changes made for Mistral mainly follow the detailed guidelines provided by the author liyihui (thanks to the author's patience), which you can check against the section identifying [modified] in modeling_llama_kv.py.

  1. make sure you import the correct libraries, classes and functions.
  2. make sure you correctly use the author's customized kv_cache, noting that its data structure is different from the original llama kv_cache, especially the indexes for accessing content, and the type of data.
  3. make sure you are using the correct attention mask for inference, you need to decide whether you want to supplement the tree_mask after the causal_mask based on the tree_mask attribute of the model. note that this is not marked very directly inside my clone's version of the code, which could lead to an error in inference. Specifically, you need to refer to the author's approach of adding a branch for judgment when generating the mask. Considering that Mistral uses a different attention mask than llama, you may need to proofread carefully.
  4. another thing I think you need to consider is whether or not to use gqa in the eagle head. from my observation, it is possible that you can get better results without using gqa to keep the structure consistent, which I have not yet verified carefully.

Second, 1.93 does not refer to tokens/second, but to the speedup ratio obtained by comparing the generation speed with vanilla autoregression.

vlbosch commented 2 months ago

Did you guys manage to successfully reproduce EAGLE 2 with Mistral? If so, I am curious as to the changes/settings that yield the best results. I'd like to train EAGLE 2 for Mistral Large, but knowing what works on the Small version could prove helpful. Thanks in advance!

zengxy20 commented 1 week ago

Hello, @alcholiclg. Have you ever encountered a situation of decoding garbled characters? I have followed your sharing to change the cache into the author's customized kv_cache and not used the tree_mask after causal_mask. I think there are wrong settings in our modeling_mistral_kv.py. Could you share this file of codes for us? Thanks in advance!

微信图片_20241123134744