Remove dependence on torch.distributed.algorithms.join. Instead size batches such that all ranks always have the same num_batches. This is possible by increasing batch sizes by 1 sample when necessary to keep num_batches equal across ranks.

facebookresearch / dlrm

An implementation of a deep learning recommendation model (DLRM)

MIT License

3.72k stars 825 forks source link

Remove dependence on torch.distributed.algorithms.join. Instead size batches such that all ranks always have the same num_batches. This is possible by increasing batch sizes by 1 sample when necessary to keep num_batches equal across ranks. #290

Closed samiwilf closed 1 year ago

samiwilf commented 1 year ago

Summary: Remove dependence on torch.distributed.algorithms.join. Instead size batches such that all ranks always have the same num_batches. This is possible by increasing batch sizes by 1 sample when necessary to keep num_batches equal across ranks.

Differential Revision: D41174060

LaMa Project: L1141030