KV Cache initialization throwing an error

haim-barad commented 4 months ago

When running the sample code for bs>1 (on that branch), I get the following error when the KV cache is initialized. I'm running this inference direct on a Xeon.

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 35
     32 prompt2 = conv.get_prompt()+" "
     34 input_s=model.tokenizer([prompt1,prompt2],return_tensors="pt",padding=True).to("cpu")
---> 35 output_ids=model.eagenerate(input_s.input_ids,input_s.attention_mask,temperature=0.0,max_new_tokens=512,top_k=15)
     36 output=model.tokenizer.batch_decode(output_ids)
     37 print(output)

File ~/anaconda3/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File /mnt/BigDisk1T/haim/EAGLE/model/ea_model.py:205, in EaModel.eagenerate(self, input_ids, attention_mask, temperature, top_p, top_k, max_new_tokens, max_length, tree_choices, log)
    199     current_length_data.zero_()
    200 else:
    201     (
    202         past_key_values,
    203         past_key_values_data,
    204         current_length_data,
--> 205     ) = initialize_past_key_values(self.base_model,bs=bs)
    206     self.past_key_values = past_key_values
    207     self.past_key_values_data = past_key_values_data

File /mnt/BigDisk1T/haim/EAGLE/model/kv_cache.py:142, in initialize_past_key_values(model, bs)
    139         bias=0
    140         start_data_m=data_m
    141     past_key_values.append(
--> 142         [
    143             KVCache(past_key_values_data_list[data_m-devices[0].index][2*bias + j], current_length_data[i * 2 + j])
    144             for j in range(2)
    145         ]
    146     )
    147     bias+=1
    148 return past_key_values, past_key_values_data_list, current_length_data

File /mnt/BigDisk1T/haim/EAGLE/model/kv_cache.py:143, in <listcomp>(.0)
    139         bias=0
    140         start_data_m=data_m
    141     past_key_values.append(
    142         [
--> 143             KVCache(past_key_values_data_list[data_m-devices[0].index][2*bias + j], current_length_data[i * 2 + j])
    144             for j in range(2)
    145         ]
    146     )
    147     bias+=1
    148 return past_key_values, past_key_values_data_list, current_length_data

TypeError: unsupported operand type(s) for -: 'NoneType' and 'NoneType'

Liyuhui-12 commented 4 months ago

Is your model running on CPU?

haim-barad commented 4 months ago

Yes - it does. The code is failing on even initializing the KV cache.

Liyuhui-12 commented 4 months ago

We have modified the code, and now it supports CPU. Running inference on a CPU requires changing some default configurations, such as modifying the model loading to

model = EaModel.from_pretrained(
    base_model_path=base_model_path,
    ea_model_path=EAGLE_model_path,
    #torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    #device_map="auto"
)

haim-barad commented 4 months ago

Great - thanks. That worked.

tigerliu10 commented 3 months ago

Dear EAGLE Team,

I've made modifications to the EAGLE code to accommodate the Qwen model, and the speed results are quite promising on GPU, with performance enhancements ranging from 2.3 to 3 times faster than the baseline model. However, when running inference on the CPU, the speed results on MT-bench are as follows:

Speed: 6.706688090355101 Speed0: 5.750603114818664 Ratio: 1.1662582091733495

Unfortunately, the speed improvement is only around 1.16 times faster. Could you please provide some suggestions on how to improve the speed on the CPU? Additionally, I'm curious to know what results you obtained when inferring on the CPU.

haim-barad commented 3 months ago

@tigerliu10 - Intel has guidelines on CPU inference and other libraries to help. Eagle is a great method and can be used together with other methods. We're getting typically 1.5-1.7 speedup, but it is also a fact that accelerators will get greater benefit from Eagle as Eagle will increase their utilization be leveraging free cycles.

wushixong commented 3 months ago

Dear EAGLE Team,

I've made modifications to the EAGLE code to accommodate the Qwen model, and the speed results are quite promising on GPU, with performance enhancements ranging from 2.3 to 3 times faster than the baseline model. However, when running inference on the CPU, the speed results on MT-bench are as follows:

Speed: 6.706688090355101 Speed0: 5.750603114818664 Ratio: 1.1662582091733495

Unfortunately, the speed improvement is only around 1.16 times faster. Could you please provide some suggestions on how to improve the speed on the CPU? Additionally, I'm curious to know what results you obtained when inferring on the CPU.

请教下，eagle适配Qwen1.5的话需要改动哪几个地方？

haim-barad commented 3 months ago

There are some changes needed to avoid Tree Attention. I think this was given in another issue and you should see better CPU performance. However, don't expect the same speedup as an accelerator as CPU inference isn't leaving spare compute cycles to leverage.

(see https://github.com/SafeAILab/EAGLE/issues/48#issuecomment-1978037582)

@wushixong - Can you share your changes for Qwen? I'm also interested in Qwen. Have you trained an EAGLE version of Qwen? Maybe the Eagle team can incorporate your changes and your fine-tuned model? Was this fine-tuned with a Chinese dataset?

SafeAILab / EAGLE

KV Cache initialization throwing an error #37