SafeAILab / EAGLE

Official Implementation of EAGLE
https://arxiv.org/pdf/2406.16858
Apache License 2.0
622 stars 59 forks source link

Quality loss in greedy mode. #75

Closed w32zhong closed 1 month ago

w32zhong commented 1 month ago

In my case, I set the model to be greedy (default ea_generate arguments), EAGLE's human-eval accuracy for LLaMA-7B drops to 5.49 from baseline's 8.54.

I have not yet looked into EAGLE's code very carefully on this issue, I am just curious if anyone has encountered the similar issue?

If there is a bug in the code, would it accidentally improve the efficiency?

Liyuhui-12 commented 1 month ago

We conducted tests and the result files are as follows. test.zip

In FP32 precision, the output of EAGLE (test/vc7b_fp32-temperature-0.0.jsonl) is completely consistent with the output of Vanilla (test/vc7_fp32_base-temperature-0.0.jsonl) (running test/compare.py), except for the question with id 92. Upon examining the corresponding output, it is found that this inconsistency is caused by different stopping strategies when the maximum length is reached. In FP16 precision, floating-point errors may lead to slight inconsistencies (see Appendix E of Specbench), but this should not result in quality loss.

Your issue may be due to the following reasons:

  1. Different stopping strategies (e.g., different maximum lengths)

  2. Failure to truncate the output correctly (see L240-L262 of EAGLE/eagle/evaluation/gen_ea_answer_vicuna.py)

w32zhong commented 1 month ago

Thanks for your prompt response. I have the same maximum lengths of 1900. However, my baseline is evaluated in another framework due to the need to compare different systems, but the target models are the same checkpoint.

Are these commands below used to generate the outputs in test.zip?

python -m eagle.evaluation.gen_ea_answer_vicuna \
         --ea-model-path yuhuili/EAGLE-Vicuna-7B-v1.3 \ 
         --base-model-path lmsys/vicuna-7b-v1.3

python -m eagle.evaluation.gen_baseline_answer_vicuna \
         --ea-model-path yuhuili/EAGLE-Vicuna-7B-v1.3 \ 
         --base-model-path lmsys/vicuna-7b-v1.3
Liyuhui-12 commented 1 month ago

Generating the files in test.zip requires two additional steps.

First, pull the latest code. The recent update roughly unified the maximum generation length. Additionally, change torch_dtype=torch.float16 to torch_dtype=torch.float32 at line 188.

w32zhong commented 1 month ago

@Liyuhui-12 Thank you so much, I will give it a shot.