Raphael-Hao / brainstorm

Compiler for Dynamic Neural Networks
43 stars 2 forks source link

How to merge multiple checkpoints from many ranks into one? #48

Closed chenwenyan closed 10 months ago

chenwenyan commented 10 months ago

Hello Raphael, I encountered an issue while executing the benchmark/swin_moe/gather_ckpt.sh script. It appears that the checkpoint named small_swin_moe_32GPU_16expert/model.pth is utilized to resume the model. However, I noticed that the shell script for data preparation is no longer valid as it has expired. Additionally, I have successfully trained the Swin-Transformer MoE model using 4 A100 GPUs and saved the checkpoints, but each checkpoint is suffixed with a different rank_id. Could you please provide guidance on how to merge these checkpoints into a single unified file (model.pth)?

Raphael-Hao commented 10 months ago

Hello, you can check the details of merging checkpoints into a single file here

chenwenyan commented 10 months ago

Hello, you can check the details of merging checkpoints into a single file here

Thanks a lot! I find it and check the updated ckpts in your scripts.