Closed ganler closed 4 years ago
In BytePS, the servers are bandwidth or CPU servers. (for other systems, they may directly use a GPU as a server)
Fully utilize CPU-GPU + GPU-GPU bandwidth + CPU utilization.
Key idea: Combine PS + All-Reduce.
Observation: As is said in BytePS, the bottleneck of collective communication is CPU-PICe data transfer. (CPU0-P0)
For naive all-reduce, there's (N-1) / k communication workload in the bottleneck.
Step 1: Local Reduce-Scatter:
We do only 1 round of All-reduce locally:
GPUs(say there're k GPUs) using the same PCIe switch talk to each other. i.e., device N send 1 over N of its gradients to device N+1. Then each device holds M/k merged gradients.
Step 2: merged grad => host memory
Step 3: QPI merge
Thus (N-1)/k => 1/k * k = 1 (N is the number of total GPUs. k is the number of GPUs in one group)
Current optimzation methods are usually considered heavy for CPU processors. So what does the optimization step typically do?
Sum is easy for CPU as instructions like AVX is optimized for such workloads. However, the update stage is too heavy for CPUs. Thus, BytePS proposed the Summation Service as the server in the PS architecture. (i.e., move the heavy computing part to powerful GPUs, and only use CPUs to perform sum operations)
Thus: CPUs and GPUs provide different kinds of services. i.e., do things they are good at.
https://static.sched.com/hosted_files/usenixosdi20/00/osdi20-jiang.pdf