Open jon-chuang opened 8 months ago
I meet the same problem. This may not because nccl, but the deepspeed. When using multiple GPUs to train, in traditional DDP paradigm, we usually save the checkpoint on rank 0. But when we use ZeRO, the optimizer will be divided (ZeRO 1) or the gradient(ZeRO 2). So we need save on all rank.