Qwen1.5-MoE-A2.7B模型训练时如何初始化的

pkumc commented 7 months ago

@JustinLin610 博客里面提到“我们首先利用已有的Qwen-1.8B，将其改造为Qwen1.5-MoE-A2.7B。此外，在初始化阶段引入随机性可以显著加快收敛速度，并在整个预训练过程中带来更好的整体性能表现”。有两个问题想请教下：

是不是先按照Qwen1.5-1.8B的intermediate_size 5504进行分割，分割成4个小的expert，每个expert是1376维。然后再加入随机性，加上随机初始化的32维，变成1408维？其余非moe的参数，就直接继承Qwen1.5-1.8B？
初始化后，博客又提到“由于我们的初始化方法，我们不需要训练同样数量的token即可达到很好的模型效果，这也显著了降低了训练成本。”这块大概是用了多少token进行继续训练的呢？

JustinLin610 commented 6 months ago

Stay tuned for our coming tech report. Temporarily we do not release details about this

github-actions[bot] commented 4 months ago

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

QwenLM / Qwen2.5

Qwen1.5-MoE-A2.7B模型训练时如何初始化的 #243