PyTorch Distributed: Experiences on Accelerating Data Parallel Training

PyTorch 在最原始的 AllReduce 中加了一些优化。第一个优化是 Gradient Bucketing，这个优化基于的观察是 AllReduce 在大 Tensor 下表现比较好。文章里通过实验证明了这一点，实验中用 AllReduce 来处理 60M Parameters，每次设定单次 AllReduce 的 Parameter 数量不同，发现数量越多越快。

Screenshot from 2020-07-27 11-29-29

These experiments suggest that, instead of launching a dedicated AllReduce immediately when each gradient tensor becomes available, DDP can achieve higher throughput and lower latency if it waits for a short period of time and buckets multiple gradients into one AllReduce operation. This would be especially helpful for models with many small parameters. However, DDP should not communicate all gradients in one single AllReduce, otherwise, no communication can start before the computation is over.

分桶后，就不一定要等后向全部结束后再去做 AllReduce，从这个角度来说，跟 BytePS 异曲同工，将计算和通信流水起来，隐藏掉部分 overhead。为了实现这个，有两个事情需要注意。一个是执行 AllReduce 的顺序一定要在多个进程中保持一致性。这个问题可以通过下图中的 (a) 来理解一下。如果计算梯度的顺序不同，进程 A 先计算了 g3，B 先计算了 g2，那就没办法流水起来，因为第一个桶子里还有没有 ready 的梯度 g2。PyTorch v1.5.0 addresses this problem by using the reverse order of model.parameters() as the bucketing order.

Screenshot from 2020-07-27 11-50-45

第二个问题，因为图是动态的，所以每次迭代可能涉及到不同的参数的梯度。但是这个分桶是在构架图的时候进行的，这里就出现了偏差。比如上图 (b)，g3 被 skip 了，但我们仍然在等 g3，就会一直 hang。解决办法就是在前向的时候把没有用到的参数定义为已经 ready 的。

Screenshot from 2020-07-27 11-59-31

dyweb / papers-notebook

PyTorch Distributed: Experiences on Accelerating Data Parallel Training #222