FlagOpen / FlagScale

FlagScale is a large model toolkit based on open-sourced projects.
Other
132 stars 40 forks source link

[BUG or ENHANCEMENT] Update qk_layernorm. #210

Open ftgreat opened 1 week ago

ftgreat commented 1 week ago

With current qk_layernorm implement training did not converge. One shared qk_layernorm acts on every head, however qk_layernorm should affect all heads. So just enlarge the shape of qk_layernorm weights, training converges as expected.

image

List some models using qk_layernorm for references:

  1. https://github.com/mlfoundations/open_lm/blob/main/open_lm/model.py#L131 used in DCLM-7B
  2. https://github.com/huggingface/transformers/blob/main/src/transformers/models/olmoe/modeling_olmoe.py#L394-L395 used in OLMoE