Open alcholiclg opened 6 months ago
It seems that your test accuracy during training is normal, so I suspect that your training (including data generation) and evaluation might be using different base model weights or templates.
It seems that your test accuracy during training is normal, so I suspect that your training (including data generation) and evaluation might be using different base model weights or templates.
Thank you very much for your answer !
Hi @alcholiclg,I am also working on integrating EAGLE with the Mistral Instruct model. Can you share the code modifications you have made to make it compatible with Mistral? Also, is an average of 1.93 tokens per forward pass the best performance you have achieved with EAGLE on Mistral?
@alcholiclg
Another discovery is that the output distribution of eagle-mistral/vicuna/llama-7B-chat does not seem to be directly aligned with the distribution generated by directly using mistral/vicuna/llama-7B-chat to generate or forward(model.generate() or token by token forward). Under the condition of trying to load all models with fp32 precision, the consistency rate of eagle-mistral/vicuna/llama-7B-chat is around 97%. I am not sure whether this is due to the difference in calculation between the tree decoding process and the valina autoregressive process.
Floating-point calculations do not satisfy associativity, so a+b+c!=a+c+b. The final distribution can be affected by GPU calculations and reduction order. If the probabilities of two tokens are very close, the chosen token may be inconsistent. However, in our tests, under fp32 precision, the Vanilla generation and EAGLE generation in Mt-bench are completely consistent at the discrete token level, except for differences caused by different truncation strategies and maximum lengths. Is your inconsistency due to this reason?
In addition, there is another problem that the acceleration ratio of eagle-mistral that I reproduced can only reach 1.93 compared with the mistral-7b-v0.3-instruct version. Based on the performance of the training process, do you think there may be a problem with the consistency of the large model and the draft model? The test setting is 8 H100 80G and fp16 precision.
In our experiments, when the draft model (LLaMA structure) is inconsistent with the base model (Mixtral 8x7B, MoE structure), the acceptance rate drops significantly. I believe the reason might be the structural inconsistency between the draft model and the base model.
Hi @alcholiclg,I am also working on integrating EAGLE with the Mistral Instruct model. Can you share the code modifications you have made to make it compatible with Mistral? Also, is an average of 1.93 tokens per forward pass the best performance you have achieved with EAGLE on Mistral?
Sorry for not being able to respond in time for personal reasons. First, the changes made for Mistral mainly follow the detailed guidelines provided by the author liyihui (thanks to the author's patience), which you can check against the section identifying [modified] in modeling_llama_kv.py.
Second, 1.93 does not refer to tokens/second, but to the speedup ratio obtained by comparing the generation speed with vanilla autoregression.
Did you guys manage to successfully reproduce EAGLE 2 with Mistral? If so, I am curious as to the changes/settings that yield the best results. I'd like to train EAGLE 2 for Mistral Large, but knowing what works on the Small version could prove helpful. Thanks in advance!
Hello, @alcholiclg. Have you ever encountered a situation of decoding garbled characters? I have followed your sharing to change the cache into the author's customized kv_cache and not used the tree_mask after causal_mask. I think there are wrong settings in our modeling_mistral_kv.py. Could you share this file of codes for us? Thanks in advance!
Dear Eagle Team:
Hello, and thank you very much for your excellent work for the community. Recently, while attempting to replicate Eagle, I encountered some issues that I have been unable to resolve, and I would greatly appreciate your insights into the possible reasons behind them.
My goal is to replicate the effects of Eagle on mistral-7b-v0.3-instruct.
Here are the settings I used:
For data generation, I employed the ge_data_all_llama2chat.py script, modifying the LLM selection to mistral-7b-v0.3-instruct. Additionally, I altered the conversation template used, removing the system_message component.
During the training phase, I utilized a small model configuration with a batch size (bsz) of 12, 8xH100, and a learning rate (lr) of 18e-5. The training metrics were aligned with the official code, and the training progress is detailed below.
In the testing phase, I initially evaluated the consistency on 80 questions from the vicuna_questions.jsonl file in the qlora codebase. Specifically, I compared the token_id outputs between the LLM and Eagle to assess their alignment. Surprisingly, the consistency was less than 10%. As a benchmark, I conducted tests using the officially provided Vicuna and Llama models, which yielded consistency rates of approximately 87% and 96%, respectively. These figures are significantly higher than my own test results.
Given the above, could you please provide me with some suggestions? I would be extremely grateful for any assistance you can offer.Thank you very much.
ssm_ids= [1, 2, 3, 4, 5, 6], llm_ids=[1, 2, 4, 5, 6, 7], alignment=33.333%