Closed yangdongdong2000 closed 1 month ago
DDP replicates the model across all GPUs, but during the backward pass, the gradients are synchronized across all these copies, ensuring that the models remain identical. Since all model copies are synchronized and identical, saving the model on only global_rank == 0 avoids saving redundant checkpoints from other ranks and is valid.
i want to ask whether DDP stragegy is valid in training code. In train, the function save_model_checkpoint seemed only save model that global_rank equals to 0. The training code is like training two model parallelly using different data, but only save the first model when using two gpus