OSDI'20 | A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters

ganler commented 4 years ago

https://static.sched.com/hosted_files/usenixosdi20/00/osdi20-jiang.pdf

ganler commented 4 years ago

Prev data-parallel approaches

All-Reduce among GPU workers => GPU-GPU bandwidth only
Parameter Server. (gradients were sent to the CPU worker to update the parameters) => CPU-GPU bandwidth only

In BytePS, the servers are bandwidth or CPU servers. (for other systems, they may directly use a GPU as a server)

What did BytePS do

Fully utilize CPU-GPU + GPU-GPU bandwidth + CPU utilization.

Key idea: Combine PS + All-Reduce.

Observation: As is said in BytePS, the bottleneck of collective communication is CPU-PICe data transfer. (CPU0-P0)

For naive all-reduce, there's (N-1) / k communication workload in the bottleneck.

Step 1: Local Reduce-Scatter:

We do only 1 round of All-reduce locally:

GPUs(say there're k GPUs) using the same PCIe switch talk to each other. i.e., device N send 1 over N of its gradients to device N+1. Then each device holds M/k merged gradients.

Step 2: merged grad => host memory

Step 3: QPI merge

Thus (N-1)/k => 1/k * k = 1 (N is the number of total GPUs. k is the number of GPUs in one group)

ganler commented 4 years ago

Solution for CPU bottleneck

Current optimzation methods are usually considered heavy for CPU processors. So what does the optimization step typically do?

Sum: to aggregate the gradients. i.e., dw = \sum dw_i
Update: Given the new gradients, update the model parameters. i.e., w += dw

Sum is easy for CPU as instructions like AVX is optimized for such workloads. However, the update stage is too heavy for CPUs. Thus, BytePS proposed the Summation Service as the server in the PS architecture. (i.e., move the heavy computing part to powerful GPUs, and only use CPUs to perform sum operations)

Thus: CPUs and GPUs provide different kinds of services. i.e., do things they are good at.

ganler / ResearchReading