It seems that for each batch, it enumerates each positive class of the batch, calculates loss, and performs backward, where gradients are synchronized across ranks with DDP. Even though each rank has the same number of batches, they may have different numbers of total positive classes. As a result, there will be different numbers of actual optimization iterations for different ranks, and can cause issues during DDP training.
Could you please share your thoughts on this? Thanks.
Dear authors,
As mentioned in https://github.com/BAAI-DCAI/SegVol/issues/2#issuecomment-1827968711, the model was trained wth multiple GPUs. Also, according to the code, it is trained with DDP. Let's have a look at a piece of code in
train_epoch
:https://github.com/BAAI-DCAI/SegVol/blob/97f91e74a4cf28a43278f597381238963f03a145/train.py#L62-L92
It seems that for each batch, it enumerates each positive class of the batch, calculates loss, and performs backward, where gradients are synchronized across ranks with DDP. Even though each rank has the same number of batches, they may have different numbers of total positive classes. As a result, there will be different numbers of actual optimization iterations for different ranks, and can cause issues during DDP training.
Could you please share your thoughts on this? Thanks.
Best wishes