How to save and load checkpoints in mult-cards multi-machines mode?

Hello! wonderfull project! Here I have a question about how to save and load checkpoints when training in multi-cards and multi-machines? I suppose that the original codes simply save the trained models in all nodes (I'm not sure if I understand) without specifying which node to save. Generally, training in multi-cards and multi-machines using ddp solely save the models of master nodes, by judging whether it is a master node like: if dist.get_rank() == 0:

Moreover, when training using DDP, how to load the saved checkpoints. I notice that in original codes, the way to load the trained models is:

ckp = torch.load(checkpoint, map_location='cpu')
nn.DataParallel(model).load_state_dict(ckp['model'])

Is DDP the same？

Looking forward to your reply. Thanks

facebookresearch / AVID-CMA

How to save and load checkpoints in mult-cards multi-machines mode? #9