ganler / ResearchReading

General system research material (not limited to paper) reading notes.
GNU General Public License v3.0
20 stars 1 forks source link

OSDI'20 | A Unified Architecture for Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters #35

Closed ganler closed 4 years ago

ganler commented 4 years ago

https://static.sched.com/hosted_files/usenixosdi20/00/osdi20-jiang.pdf

ganler commented 4 years ago

Prev data-parallel approaches

In BytePS, the servers are bandwidth or CPU servers. (for other systems, they may directly use a GPU as a server)

What did BytePS do

Fully utilize CPU-GPU + GPU-GPU bandwidth + CPU utilization.

Key idea: Combine PS + All-Reduce.

image

Observation: As is said in BytePS, the bottleneck of collective communication is CPU-PICe data transfer. (CPU0-P0)

For naive all-reduce, there's (N-1) / k communication workload in the bottleneck.

image

Step 1: Local Reduce-Scatter:

We do only 1 round of All-reduce locally:

GPUs(say there're k GPUs) using the same PCIe switch talk to each other. i.e., device N send 1 over N of its gradients to device N+1. Then each device holds M/k merged gradients.

Step 2: merged grad => host memory

image

Step 3: QPI merge

Thus (N-1)/k => 1/k * k = 1 (N is the number of total GPUs. k is the number of GPUs in one group)

ganler commented 4 years ago

Solution for CPU bottleneck

Current optimzation methods are usually considered heavy for CPU processors. So what does the optimization step typically do?

Sum is easy for CPU as instructions like AVX is optimized for such workloads. However, the update stage is too heavy for CPUs. Thus, BytePS proposed the Summation Service as the server in the PS architecture. (i.e., move the heavy computing part to powerful GPUs, and only use CPUs to perform sum operations)

image

Thus: CPUs and GPUs provide different kinds of services. i.e., do things they are good at.