Generation quality loss

SafeAILab / EAGLE

Official Implementation of EAGLE

https://arxiv.org/pdf/2406.16858

Apache License 2.0

622 stars 59 forks source link

Generation quality loss #50

Closed cyLi-Tiger closed 3 months ago

cyLi-Tiger commented 3 months ago

Hi, I noticed you mentioned in paper that

evaluating the quality of EAGLE’s generated results is both unnecessary and meaningless.

But based on my experiments on both llama2-chat-7b and Qwen-chat-7b, EAGLE's generation quality declined on c-eval and human-eval. The result is attached.

Baseline is the output from hf model.generate() , eagle is the output from ea_model.eagenerate().

Have you done similar experiments? Any clues?

Liyuhui-12 commented 3 months ago

EAGLE is a speculative sampling method, theoretically ensuring that the distribution of the generated text remains unchanged. The proof can be found in the Appendix A of the paper. What were your test settings, greedy or non-greedy? Did the evaluators know which response was generated by EAGLE? There was a bug in EAGLE in the non-greedy setting before, which could lead to this issue.

cyLi-Tiger commented 3 months ago

non-greedy, can you share what's the bug? In case I got the same problem. @Liyuhui-12

cyLi-Tiger commented 3 months ago

Btw, do you mind sharing your sampling parameters like top_p for non-greedy setting? I think smaller top_p make the decoding strategy more like greedy decoding. Based on my experiments, the accuracy result from eval is kind of not acceptable when using normal top_p value like 0.8, and it goes higher with smaller top_p. So if I want to get an acceptable accuracy, I need to set top_p like 0.2, which is not very common.

cyLi-Tiger commented 3 months ago

Another question for this MODIFIED, what's gonna happen is the mask type not set to float32? Is it gonna cause accuracy loss?

Liyuhui-12 commented 3 months ago

Another question for this MODIFIED, what's gonna happen is the mask type not set to float32? Is it gonna cause accuracy loss?

This is copied from Medusa, and forcing the use of high precision should not result in a loss of quality.

Liyuhui-12 commented 3 months ago

@cyLi-Tiger We have fixed a bug related to sampling. After fixing this bug, we conducted a large number of repeated experiments for validation. We replaced the original LLM with a model that outputs distributions based on position_ids and performed approximately 400,000 experiments. As shown in the figure below, the distribution of the original LLM is highly consistent with the data sampled using the current version of the code. Therefore, we believe there is no loss of quality now. You can find the code in the ./testbug directory. Execute python testbbug.py for repeated experiments, and run python vis.py to plot the results.

w32zhong commented 1 month ago

In my case, I set the model to be greedy (default ea_generate arguments), EAGLE's human-eval accuracy for LLaMA-7B drops to 5.49 from baseline's 8.54.

Any idea or related bug fixes?