BAAI-DCAI / SegVol

The official code for "SegVol: Universal and Interactive Volumetric Medical Image Segmentation".
MIT License
236 stars 21 forks source link

How is the training synchronized across ranks during distributed training? #21

Open function2-llx opened 5 months ago

function2-llx commented 5 months ago

Dear authors,

As mentioned in https://github.com/BAAI-DCAI/SegVol/issues/2#issuecomment-1827968711, the model was trained wth multiple GPUs. Also, according to the code, it is trained with DDP. Let's have a look at a piece of code in train_epoch:

https://github.com/BAAI-DCAI/SegVol/blob/97f91e74a4cf28a43278f597381238963f03a145/train.py#L62-L92

It seems that for each batch, it enumerates each positive class of the batch, calculates loss, and performs backward, where gradients are synchronized across ranks with DDP. Even though each rank has the same number of batches, they may have different numbers of total positive classes. As a result, there will be different numbers of actual optimization iterations for different ranks, and can cause issues during DDP training.

Could you please share your thoughts on this? Thanks.

Best wishes