Closed haim-barad closed 4 months ago
Is your model running on CPU?
Yes - it does. The code is failing on even initializing the KV cache.
We have modified the code, and now it supports CPU. Running inference on a CPU requires changing some default configurations, such as modifying the model loading to
model = EaModel.from_pretrained(
base_model_path=base_model_path,
ea_model_path=EAGLE_model_path,
#torch_dtype=torch.float16,
low_cpu_mem_usage=True,
#device_map="auto"
)
Great - thanks. That worked.
Dear EAGLE Team,
I've made modifications to the EAGLE code to accommodate the Qwen model, and the speed results are quite promising on GPU, with performance enhancements ranging from 2.3 to 3 times faster than the baseline model. However, when running inference on the CPU, the speed results on MT-bench are as follows:
Speed: 6.706688090355101 Speed0: 5.750603114818664 Ratio: 1.1662582091733495
Unfortunately, the speed improvement is only around 1.16 times faster. Could you please provide some suggestions on how to improve the speed on the CPU? Additionally, I'm curious to know what results you obtained when inferring on the CPU.
@tigerliu10 - Intel has guidelines on CPU inference and other libraries to help. Eagle is a great method and can be used together with other methods. We're getting typically 1.5-1.7 speedup, but it is also a fact that accelerators will get greater benefit from Eagle as Eagle will increase their utilization be leveraging free cycles.
Dear EAGLE Team,
I've made modifications to the EAGLE code to accommodate the Qwen model, and the speed results are quite promising on GPU, with performance enhancements ranging from 2.3 to 3 times faster than the baseline model. However, when running inference on the CPU, the speed results on MT-bench are as follows:
Speed: 6.706688090355101 Speed0: 5.750603114818664 Ratio: 1.1662582091733495
Unfortunately, the speed improvement is only around 1.16 times faster. Could you please provide some suggestions on how to improve the speed on the CPU? Additionally, I'm curious to know what results you obtained when inferring on the CPU.
请教下,eagle适配Qwen1.5的话需要改动哪几个地方?
There are some changes needed to avoid Tree Attention. I think this was given in another issue and you should see better CPU performance. However, don't expect the same speedup as an accelerator as CPU inference isn't leaving spare compute cycles to leverage.
(see https://github.com/SafeAILab/EAGLE/issues/48#issuecomment-1978037582)
@wushixong - Can you share your changes for Qwen? I'm also interested in Qwen. Have you trained an EAGLE version of Qwen? Maybe the Eagle team can incorporate your changes and your fine-tuned model? Was this fine-tuned with a Chinese dataset?
When running the sample code for bs>1 (on that branch), I get the following error when the KV cache is initialized. I'm running this inference direct on a Xeon.