SafeAILab / EAGLE

Official Implementation of EAGLE-1 and EAGLE-2
https://arxiv.org/pdf/2406.16858
Apache License 2.0
713 stars 74 forks source link

How to handle embedding layernorm #91

Open xiongqisong opened 1 month ago

xiongqisong commented 1 month ago

some model may do layernorm after embedding, then send it to Attention layers, when face this type model, do i need to add embedding layernorm to eagle or any other trick which i need do to make eagle output right tokens. I don't know why need -2 when generate train data in llama, and how to change the -2 in myown ge_data script to other model, from now on, i try not -2 at generrate data, and add embedding layernorm or not for training eagle, both don't make good result in parallel decoding, i'm confused, the Model is BluLM-7B-Chat, thanks for helping me!

Liyuhui-12 commented 1 month ago

The hidden state input to the draft model is after the norm layer, so we did not use a norm layer before the attention in the draft model.

Liyuhui-12 commented 1 month ago

What do you mean by -2?

xiongqisong commented 1 month ago

The hidden state input to the draft model is after the norm layer, so we did not use a norm layer before the attention in the draft model.

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

xiongqisong commented 1 month ago

What do you mean by -2?

In data generate python script, eagle code comment shows below: 图片 I don't know if i need to change the number to minus when i try to implement eagle in any other Model?

Liyuhui-12 commented 1 month ago

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

Due to computational resource constraints, we have not conducted experiments on adding an additional norm layer.

Liyuhui-12 commented 1 month ago

In data generate python script, eagle code comment shows below: 图片 I don't know if i need to change the number to minus when i try to implement eagle in any other Model?

This is to ensure the correct position of the loss mask. You can check tokenizer.decode(input_ids[loss_mask_pos]), which should correspond to the human instruction part offset by one token.

xiongqisong commented 1 month ago

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

Due to computational resource constraints, we have not conducted experiments on adding an additional norm layer.

I try to add embedding layernorm into Eagle to make structrue of Eagle is similar to the Original Model, i find after add embedding layernorm, Eagles work good, if i remove embedding layernorm, Eagle works bad. I don't know why, just report the appearance to you~

xiongqisong commented 1 month ago

In data generate python script, eagle code comment shows below: 图片 I don't know if i need to change the number to minus when i try to implement eagle in any other Model?

This is to ensure the correct position of the loss mask. You can check tokenizer.decode(input_ids[loss_mask_pos]), which should correspond to the human instruction part offset by one token.

3Q for you reply, now i know how to estimate this value! It's very helpful~

xiongqisong commented 1 month ago

I find Eagle has to many detail in the code implemention, so if it's convenient, i wish more comment or write a doc about the code design, i already add some in my fork to help me understand Eagle's complex logic, even some details which aren't mentioned in the article.

fousdfrf commented 2 days ago

I'm not mean the hidden state, i mean the embedding of input tokens. The main model generate first token, then send it to eagle, eagle will embedding the input tokens, then if eagle need to do layernorm to the embedding befor concat embedding with hidden state?

Due to computational resource constraints, we have not conducted experiments on adding an additional norm layer.

I try to add embedding layernorm into Eagle to make structrue of Eagle is similar to the Original Model, i find after add embedding layernorm, Eagles work good, if i remove embedding layernorm, Eagle works bad. I don't know why, just report the appearance to you~

Hello, I would like to ask how much performance improvement can be achieved by adding this norm layer? Is it added during training, and then Eagle is retrained with it, or is it only added during inference?