Can not reproduce speedup results reported in the paper.

shaonan1993 commented 3 weeks ago

Hi, thanks for you great work! I'm interested in the work and trying to reproduce results reported in the paper. However, Even though I used your open-source code model checkpoint, I still couldn't reproduce the results from the paper.

I downloaded yuhuili/EAGLE-LLaMA3-Instruct-8B and followed the instructions in the README to sequentially execute gen_baseline_answer_llama3chat.py, gen_ea_answer_llama3chat.py and speed.py on a A100-SXM4-80GB device.

Here is the speedup ratio I obtained.

speedup(report):  3.46x
our result: 2.634

I understand that the actual speedup ratio is related to the runtime environment, but the discrepancy between this result and the one reported in the paper is still too large. Moreover, I have read the discussion in #5 , and I am pretty sure that there were no other programs running on the GPU during my test.

Could you provide some suggestions for reproducing the results from the paper?

Liyuhui-12 commented 3 weeks ago

We use PyTorch as the backend, so during the generation process, the CPU frequently needs to call kernel functions and perform pre-processing and post-processing for the computations. Therefore, the absolute value of the speedup ratio is also related to CPU utilization.

shaonan1993 commented 3 weeks ago

We use PyTorch as the backend, so during the generation process, the CPU frequently needs to call kernel functions and perform pre-processing and post-processing for the computations. Therefore, the absolute value of the speedup ratio is also related to CPU utilization.

I understand. If the speedup ratio is highly dependent on the runtime environment. Would average_acceptance_len or acceptance_ratio be a better metric for comparing the performance of different draft models? Considering that these two metrics are environment-independent.

Liyuhui-12 commented 2 weeks ago

average_acceptance_len or acceptance_ratio are optional metrics, but their drawback is that they do not reflect the overhead of the draft model. Comparing different methods on the same machine is also optional.

SafeAILab / EAGLE

Can not reproduce speedup results reported in the paper. #121