Open shaonan1993 opened 3 weeks ago
We use PyTorch as the backend, so during the generation process, the CPU frequently needs to call kernel functions and perform pre-processing and post-processing for the computations. Therefore, the absolute value of the speedup ratio is also related to CPU utilization.
We use PyTorch as the backend, so during the generation process, the CPU frequently needs to call kernel functions and perform pre-processing and post-processing for the computations. Therefore, the absolute value of the speedup ratio is also related to CPU utilization.
I understand. If the speedup ratio is highly dependent on the runtime environment. Would average_acceptance_len or acceptance_ratio be a better metric for comparing the performance of different draft models? Considering that these two metrics are environment-independent.
average_acceptance_len
or acceptance_ratio
are optional metrics, but their drawback is that they do not reflect the overhead of the draft model. Comparing different methods on the same machine is also optional.
Hi, thanks for you great work! I'm interested in the work and trying to reproduce results reported in the paper. However, Even though I used your open-source code model checkpoint, I still couldn't reproduce the results from the paper.
I downloaded yuhuili/EAGLE-LLaMA3-Instruct-8B and followed the instructions in the README to sequentially execute gen_baseline_answer_llama3chat.py, gen_ea_answer_llama3chat.py and speed.py on a A100-SXM4-80GB device.
Here is the speedup ratio I obtained.
I understand that the actual speedup ratio is related to the runtime environment, but the discrepancy between this result and the one reported in the paper is still too large. Moreover, I have read the discussion in #5 , and I am pretty sure that there were no other programs running on the GPU during my test.
Could you provide some suggestions for reproducing the results from the paper?