SoCC 2018 | Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

chufanchen commented 7 months ago

https://dl.acm.org/doi/abs/10.1145/3267809.3275463

chufanchen commented 7 months ago

Setting: In heterogeneous or shared clusters where workers running on less capable computing resources would progress slower and become stragglers.

Stragglers slow down the distributed learning process, by either prolonging the duration of each iteration (low hardware efficiency), or alternatively depending on the mechanism used for worker coordination, requiring more iterations for DL models to converge (low statistical efficiency).

Existing load balancing solutions [1] bring non-negligible computation/communication overheads, and are too time-consuming for typical DL workloads, whose iterations are quite short.

[1] Addressing the straggler problem for iterative convergent parallel ML

chufanchen commented 7 months ago

Load-Balanced Bulk Synchronous Parallel(LB-BSP): worker-adpative batch sizing

LB-BSP adaptively adjusts workers’ batch size based on their processing capabilities, so that all workers can finish each iteration simultaneously.

See #79 for details.

chufanchen / read-paper-and-code

SoCC 2018 | Fast Distributed Deep Learning via Worker-adaptive Batch Sizing #81

Load-Balanced Bulk Synchronous Parallel(LB-BSP): worker-adpative batch sizing