chufanchen / read-paper-and-code

0 stars 0 forks source link

OSDI 20 | A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters #67

Closed chufanchen closed 7 months ago

chufanchen commented 7 months ago

https://github.com/bytedance/byteps

https://www.usenix.org/conference/osdi20/presentation/jiang

chufanchen commented 7 months ago

For distributed training, there are two families of data parallelism approaches, i.e., all-reduce and Parameter Server (PS).

We assume that we have $n$ GPU machines for a data-parallel training job. The DNN model size is $M$ bytes. The network bandwidth is $B$.

image

image

chufanchen commented 7 months ago

All-reduce

all-reduce = reduce-scatter + all-gather

reduce-scatter: Each node sends (and receives) $(n-1)M/n$ traffic

all-gather: Each node sends (and receives) $(n-1)M/n$ traffic

The time required by all-reduce operation is $2(n-1)M/nB$ which is proved to be the optimal in topologies with uniformed link bandwidth [1], assuming no additional resources.

In hierarchical topologies with non-uniformed link bandwidth, the optimal hierarchical strategy would require at least $2(n'-1)M/n'B'$ communication time, where $B'$ is the slowest link bandwidth and $n'$ is the number of nodes with the slowest links.

image

[1] Bandwidth Optimal Allreduce Algorithms for Clusters of Workstations

chufanchen commented 7 months ago

Parameter Server (PS)

Role

Placement strategies

non-colocated mode: $k$ CPU machines colocated mode: The PS and GPU worker on the same machine will communicate through loopback traffic.

All-reduce Non-Colocated PS Colocated PS
Time $\frac{2(n-1)M}{nB}$ $max(\frac{M}{B}, \frac{nM}{kB})$ $\frac{2(n-1)M}{nB}$
Optimal Only if $k=0$ Only if $k=n$ Only if $k=0$

image

chufanchen commented 7 months ago

All-reduce vs. PS

  1. different communication patterns
    • PS: bipartite graph
    • Non-colocated PS: leverage additional CPU and bandwidth resources, under-utilize the resources of GPU machines
    • Colocated PS and all-reduce: utilize the GPU worker resources better, while cannot use additional CPU machines
  2. PS supports asynchronous training, which allows GPU workers to run at different speed and mitigates the impact of straggler, while all-reduce does not support it. However, asynchronous training is less popular because it can slow down model convergence.

Motivation

There are spare CPUs and bandwidth in production GPU clusters.

Existing all-reduce and PS architectures are insufficient.

  1. Sub-optimal Inter-machine Communication

    • If use all-reduce $\rightarrow$ cannot leverage CPU machines
    • If use PS $\rightarrow$ may create traffic hotspot when CPU not enough
    • Existing solutions fail to address the characteristics of heterogeneous clusters
  2. Sub-optimal Intra-machine Communication

    • The NIC bandwidth(100Gbps for ConnectX-5 EN) is close to PCIe’s bandwidth(128Gbps for PCIe 3.0 x16)
    • Existing solutions cause PCIe contention $\rightarrow$ NIC not saturated
  3. The CPU Bottleneck

    • On CPU server: $W'=W-f(\nabla W)$, $f$ is the optimizer function
    • For $f$ is Adam or RMSProp, it exceed the memory access limit.(6-channel DDR4-2666 memory, Max # memory access is $1024/100 \approx 10$)
    • CPU server cannot match network rate.(The throughput of RMSProp on CPU is 80Gbps, network thtoughput is 100Gbps)

BytePS

Goals:

chufanchen commented 7 months ago

BytePS Architecture

Every training iteration, each CS send in total $M$ bytes to and receive $M$ bytes from SS.

This architecture enables BytePS to flexibly utilize any number of additional CPU resources and network bandwidth.

image