How is the training synchronized across ranks during distributed training?

Dear authors,

As mentioned in https://github.com/BAAI-DCAI/SegVol/issues/2#issuecomment-1827968711, the model was trained wth multiple GPUs. Also, according to the code, it is trained with DDP. Let's have a look at a piece of code in train_epoch:

https://github.com/BAAI-DCAI/SegVol/blob/97f91e74a4cf28a43278f597381238963f03a145/train.py#L62-L92

It seems that for each batch, it enumerates each positive class of the batch, calculates loss, and performs backward, where gradients are synchronized across ranks with DDP. Even though each rank has the same number of batches, they may have different numbers of total positive classes. As a result, there will be different numbers of actual optimization iterations for different ranks, and can cause issues during DDP training.

Could you please share your thoughts on this? Thanks.

Best wishes

BAAI-DCAI / SegVol

How is the training synchronized across ranks during distributed training? #21