Comparison Between MLA and MHA in dense model

deepseek-ai / DeepSeek-V2

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

MIT License

3.47k stars 143 forks source link

Comparison Between MLA and MHA in dense model #26

Open mx8435 opened 4 months ago

mx8435 commented 4 months ago

Hi, great job. Did you have a ablation study about the performance between MLA and MHA in dense model ? Thanks.

luofuli commented 4 months ago

@mx8435 Our early experiments have already verified that on the 7B dense model, MLA outperforms MHA (by aligning the overall number of parameters of the models, the lesser parameter count in MLA is compensated by increasing the number of layers).