Closed chufanchen closed 7 months ago
For distributed training, there are two families of data parallelism approaches, i.e., all-reduce and Parameter Server (PS).
We assume that we have $n$ GPU machines for a data-parallel training job. The DNN model size is $M$ bytes. The network bandwidth is $B$.
all-reduce = reduce-scatter + all-gather
reduce-scatter: Each node sends (and receives) $(n-1)M/n$ traffic
all-gather: Each node sends (and receives) $(n-1)M/n$ traffic
The time required by all-reduce operation is $2(n-1)M/nB$ which is proved to be the optimal in topologies with uniformed link bandwidth [1], assuming no additional resources.
In hierarchical topologies with non-uniformed link bandwidth, the optimal hierarchical strategy would require at least $2(n'-1)M/n'B'$ communication time, where $B'$ is the slowest link bandwidth and $n'$ is the number of nodes with the slowest links.
[1] Bandwidth Optimal Allreduce Algorithms for Clusters of Workstations
non-colocated mode: $k$ CPU machines colocated mode: The PS and GPU worker on the same machine will communicate through loopback traffic.
All-reduce | Non-Colocated PS | Colocated PS | |
---|---|---|---|
Time | $\frac{2(n-1)M}{nB}$ | $max(\frac{M}{B}, \frac{nM}{kB})$ | $\frac{2(n-1)M}{nB}$ |
Optimal | Only if $k=0$ | Only if $k=n$ | Only if $k=0$ |
There are spare CPUs and bandwidth in production GPU clusters.
Existing all-reduce and PS architectures are insufficient.
Sub-optimal Inter-machine Communication
Sub-optimal Intra-machine Communication
The CPU Bottleneck
Goals:
Every training iteration, each CS send in total $M$ bytes to and receive $M$ bytes from SS.
This architecture enables BytePS to flexibly utilize any number of additional CPU resources and network bandwidth.
https://github.com/bytedance/byteps
https://www.usenix.org/conference/osdi20/presentation/jiang