Closed chufanchen closed 7 months ago
Setting: In heterogeneous or shared clusters where workers running on less capable computing resources would progress slower and become stragglers.
Stragglers slow down the distributed learning process, by either prolonging the duration of each iteration (low hardware efficiency), or alternatively depending on the mechanism used for worker coordination, requiring more iterations for DL models to converge (low statistical efficiency).
Existing load balancing solutions [1] bring non-negligible computation/communication overheads, and are too time-consuming for typical DL workloads, whose iterations are quite short.
[1] Addressing the straggler problem for iterative convergent parallel ML
LB-BSP adaptively adjusts workers’ batch size based on their processing capabilities, so that all workers can finish each iteration simultaneously.
See #79 for details.
https://dl.acm.org/doi/abs/10.1145/3267809.3275463