Hello! wonderfull project! Here I have a question about how to save and load checkpoints when training in multi-cards and multi-machines? I suppose that the original codes simply save the trained models in all nodes (I'm not sure if I understand) without specifying which node to save. Generally, training in multi-cards and multi-machines using ddp solely save the models of master nodes, by judging whether it is a master node like:
if dist.get_rank() == 0:
Moreover, when training using DDP, how to load the saved checkpoints. I notice that in original codes, the way to load the trained models is:
Hello! wonderfull project! Here I have a question about how to save and load checkpoints when training in multi-cards and multi-machines? I suppose that the original codes simply save the trained models in all nodes (I'm not sure if I understand) without specifying which node to save. Generally, training in multi-cards and multi-machines using ddp solely save the models of master nodes, by judging whether it is a master node like:
if dist.get_rank() == 0:
Moreover, when training using DDP, how to load the saved checkpoints. I notice that in original codes, the way to load the trained models is:
Is DDP the same?
Looking forward to your reply. Thanks