OSDI 20 | A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

chufanchen commented 7 months ago

https://www.usenix.org/conference/osdi20/presentation/jiang

chufanchen commented 7 months ago

For distributed training, there are two families of data parallelism approaches, i.e., all-reduce and Parameter Server (PS).

We assume that we have $n$ GPU machines for a data-parallel training job. The DNN model size is $M$ bytes. The network bandwidth is $B$.

chufanchen commented 7 months ago

All-reduce

homogeneous setup, no way to utilize additional non-worker nodes

all-reduce = reduce-scatter + all-gather

reduce-scatter: Each node sends (and receives) $(n-1)M/n$ traffic

all-gather: Each node sends (and receives) $(n-1)M/n$ traffic

The time required by all-reduce operation is $2(n-1)M/nB$ which is proved to be the optimal in topologies with uniformed link bandwidth [1], assuming no additional resources.

In hierarchical topologies with non-uniformed link bandwidth, the optimal hierarchical strategy would require at least $2(n'-1)M/n'B'$ communication time, where $B'$ is the slowest link bandwidth and $n'$ is the number of nodes with the slowest links.

[1] Bandwidth Optimal Allreduce Algorithms for Clusters of Workstations

chufanchen commented 7 months ago

Parameter Server (PS)

Role

Worker: perform FP and BP, push the gradients to PS, pull the latest parameters from PS and start next iteration
PS: aggregate the gradients from different workers and update the parameters

Placement strategies

non-colocated mode: PS processes are deployed on dedicated CPU machines
colocated mode: PS processes are deployed on every GPU worker and reuses its spare CPU resources

non-colocated mode: $k$ CPU machines colocated mode: The PS and GPU worker on the same machine will communicate through loopback traffic.

	All-reduce	Non-Colocated PS	Colocated PS
Time	$\frac{2(n-1)M}{nB}$	$max(\frac{M}{B}, \frac{nM}{kB})$	$\frac{2(n-1)M}{nB}$
Optimal	Only if $k=0$	Only if $k=n$	Only if $k=0$

chufanchen commented 7 months ago

All-reduce vs. PS

different communication patterns
- PS: bipartite graph
- Non-colocated PS: leverage additional CPU and bandwidth resources, under-utilize the resources of GPU machines
- Colocated PS and all-reduce: utilize the GPU worker resources better, while cannot use additional CPU machines
PS supports asynchronous training, which allows GPU workers to run at different speed and mitigates the impact of straggler, while all-reduce does not support it. However, asynchronous training is less popular because it can slow down model convergence.

Motivation

There are spare CPUs and bandwidth in production GPU clusters.

Existing all-reduce and PS architectures are insufficient.

Sub-optimal Inter-machine Communication
- If use all-reduce $\rightarrow$ cannot leverage CPU machines
- If use PS $\rightarrow$ may create traffic hotspot when CPU not enough
- Existing solutions fail to address the characteristics of heterogeneous clusters
Sub-optimal Intra-machine Communication
- The NIC bandwidth(100Gbps for ConnectX-5 EN) is close to PCIe’s bandwidth(128Gbps for PCIe 3.0 x16)
- Existing solutions cause PCIe contention $\rightarrow$ NIC not saturated
The CPU Bottleneck
- On CPU server: $W'=W-f(\nabla W)$, $f$ is the optimizer function
- For $f$ is Adam or RMSProp, it exceed the memory access limit.(6-channel DDR4-2666 memory, Max # memory access is $1024/100 \approx 10$)
- CPU server cannot match network rate.(The throughput of RMSProp on CPU is 80Gbps, network thtoughput is 100Gbps)

BytePS

Goals:

BytePS is always communication optimal with any additional CPU and bandwidth resources, i.e., $0 \leq k \leq n# , allocated by the cluster scheduler.
BytePS can achieve communication time very close to the theoretical optimal.

chufanchen commented 7 months ago

BytePS Architecture

Communication Service(CS):internally synchronizing the tensors among multiple (if there are) local GPUs and externally communicating with SS
Summation Service(SS): runs on the CPU of every machine, receives tensors from CS, summing up and send them back to CS

Every training iteration, each CS send in total $M$ bytes to and receive $M$ bytes from SS.

This architecture enables BytePS to flexibly utilize any number of additional CPU resources and network bandwidth.

chufanchen / read-paper-and-code