deepseek-ai / DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
MIT License
3.47k stars 143 forks source link

Failure to reproduce MLA > MHA #23

Open faresobeid opened 4 months ago

faresobeid commented 4 months ago

I tried out MLA and it was a good amount worse than MHA and wanted to try to find out why. Firstly, I am using a hybrid model therefore I am not using any Rope in either MLA or MHA, and therefore use the basic version of MLA. I suspect the issue could be due to the part saying: "In addition, the low-rank compression and fine-grained expert segmentation will impact the output scale of a layer. Therefore, in practice, we employ additional RMS Norm layers after the compressed latent vectors, and multiply additional scaling factors at the width bottlenecks (i.e., the compressed latent vectors and the intermediate hidden states of routed experts) to ensure stable training." It is unclear if the additional scaling factor is done before or after the RMSNorm, also what this factor would be. Another reason could be that the rope version of MLA gives it a performance boost.

Any clarification on this scaling factor and its placement would be great, thanks!

luofuli commented 4 months ago

The following are factors that affect the final result:

  1. Rope positional embedding
  2. Scaling factors (you can check the open-source checkpoint)

@faresobeid

faresobeid commented 4 months ago

Thank you!

faresobeid commented 4 months ago

Sorry to reopen this issue but I have been having some issues with stability at scale with MLA. Like I said before I am using a hybrid model and therefore the MLA that I am using is the basic version with no Rope.

kv = W_b(rms_norm(W_a(x))) I have also tried having

kv = W_b(rms_norm(W_a(ln(x)))) but that also has some issues with performance and stability. To be more specific, this is a 24 layer model with model dimension 2048 with the last third of layers using MLA (~300M params). Is there any recommended scaling factors or ways to mitigate this issue? Thank you

luofuli commented 4 months ago

24-layer dense model? @faresobeid

faresobeid commented 4 months ago

Yes, although stability has been fine without the inner rms norm but still any recommendations would be helpful