TsinghuaAI / CPM-2-Pretrain

Code for CPM-2 Pre-Train
MIT License
159 stars 26 forks source link

Model parallelism of CPM2-MoE #15

Closed MichaelXSChen closed 3 years ago

MichaelXSChen commented 3 years ago

May i know how is the model partitioned in CPM-2 MoE? It seems each rank only takes 1 expert, and each expert is further partitioned (i.e., 256 model partitions in total)?

Thank you for your info.

zzy14 commented 3 years ago

Hi,

Yes, you are right! There are 32 experts and model parallelism is set to 8 GPUs. Hence, the total number of model partitions is 256.

XiaoqingNLP commented 2 years ago

@zzy14 @MichaelXSChen What confuses me is this parameter setting, shouldn't it be d_ffn (10240) * 32 here???https://github.com/TsinghuaAI/CPM-2-Pretrain/blob/a00b3dd70d71a796a1ed2a925ddf7902e0209ab3/src/configs/model/enc_dec_xlarge_config.json#L3