With current qk_layernorm implement training did not converge.
One shared qk_layernorm acts on every head, however qk_layernorm should affect all heads.
So just enlarge the shape of qk_layernorm weights, training converges as expected.
List some models using qk_layernorm for references:
With current qk_layernorm implement training did not converge. One shared qk_layernorm acts on every head, however qk_layernorm should affect all heads. So just enlarge the shape of qk_layernorm weights, training converges as expected.
List some models using qk_layernorm for references: