Closed chenwenyan closed 10 months ago
Hello, you can check the details of merging checkpoints into a single file here
Hello, you can check the details of merging checkpoints into a single file here
Thanks a lot! I find it and check the updated ckpts in your scripts.
Hello Raphael, I encountered an issue while executing the benchmark/swin_moe/gather_ckpt.sh script. It appears that the checkpoint named small_swin_moe_32GPU_16expert/model.pth is utilized to resume the model. However, I noticed that the shell script for data preparation is no longer valid as it has expired. Additionally, I have successfully trained the Swin-Transformer MoE model using 4 A100 GPUs and saved the checkpoints, but each checkpoint is suffixed with a different rank_id. Could you please provide guidance on how to merge these checkpoints into a single unified file (model.pth)?