SafeAILab / EAGLE

Official Implementation of EAGLE
https://arxiv.org/pdf/2406.16858
Apache License 2.0
622 stars 59 forks source link

Inference on the CPU, the speed improvement is limited #48

Closed tigerliu10 closed 2 months ago

tigerliu10 commented 3 months ago

Dear EAGLE Team,

I've made modifications to the EAGLE code to accommodate the Qwen model, and the speed results are quite promising on GPU, with performance enhancements ranging from 2.3 to 3 times faster than the baseline model. However, when running inference on the CPU, the speed results on MT-bench are as follows:

Speed: 6.706688090355101 Speed0: 5.750603114818664 Ratio: 1.1662582091733495 Unfortunately, the speed improvement is only around 1.16 times faster. Could you please provide some suggestions on how to improve the speed on the CPU? Additionally, I'm curious to know what results you obtained when inferring on the CPU.

cpu's config: model name : Intel(R) Xeon(R) Platinum 8452Y cpu MHz : 2000.000 cache size : 69120 KB physical id : 1 siblings : 72 core id : 35 cpu cores : 36 apicid : 199 initial apicid : 199

hongyanz commented 3 months ago

Speculative sampling and EAGLE are built upon the premise that there is redundant computing capability for parallelization. As CPU is not good at parallel computing, it is normal to see that the acceleration on CPUs will be less obvious than that on GPUs.

If you like, we would welcome you to contribute your EAGLE weight for the Qwen model to this repo. We will add you to our contributor list.

Liyuhui-12 commented 3 months ago

When using CPU, we recommend that you do not use tree attention. Speculative sampling and EAGLE will increase computational load, so acceleration is predicated on having redundant computational capacity. Compared to GPU, CPU has less redundant computational resources. Tree attention will utilize more computing resources. With lower parallel computing capabilities, the benefits of tree attention can decrease or even become negative.

Not using tree attention is a special case of tree attention itself. Therefore, you can conveniently disable tree attention by modifying its configuration. This configuration is _mc_sim_7b63 in model/choices.py. Modifying it this way means not using tree attention. image

The optimal value for this configuration depends on the specific hardware. If redundant computing resources are limited, you can further shorten the length of the draft, for example, mc_sim_7b_63 = [[0],[0,0],[0,0,0]]. If there are ample redundant computing resources, you can use a smaller tree. Some details can be found at #6