ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters

dyweb / papers-notebook

:page_facing_up: :cn: :page_with_curl: 论文阅读笔记（分布式系统、虚拟化、机器学习）Papers Notebook (Distributed System, Virtualization, Machine Learning)

https://github.com/dyweb/papers-notebook/issues?utf8=%E2%9C%93&q=is%3Aissue+is%3Aopen+-label%3ATODO-%E6%9C%AA%E8%AF%BB

Apache License 2.0

2.15k stars 251 forks source link

ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters #282

Open gaocegege opened 3 years ago

gaocegege commented 3 years ago

https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/

来源：https://github.com/Tencent/PatrickStar feifeibear@

gaocegege commented 3 years ago

https://arxiv.org/abs/1910.02054v3

ZeRO Paper

论文太长了，24 页，还有挺多 typo，undingable

gaocegege commented 3 years ago

https://zhuanlan.zhihu.com/p/108571246

gaocegege commented 3 years ago

数据并行的分布式训练中，每个 GPU 都需要有一个完整的模型在 GPU 显存里。这就引入了很多的冗余，使得数据并行的训练中模型无法超出单卡显存的大小。

ZeRO 这个工作通过对数据并行训练中的模型状态（states）进行了划分，把不同的分区（partition）存储在不同 worker 的 GPU 上，而不再保留一个完整的模型。这就使得数据并行的训练不再受限于单卡显存，而是拥有了线性扩展的能力。

ZeRO 将模型状态划分为三类：

优化器状态（optimizer states），内存减少4倍，通信量（communication volume）与数据并行性相同（不确定是指数量级相同还是完全相同）
梯度，8x memory reduction, same communication volume as data parallelism
参数，内存减少与数据并行度 Nd （可以理解为 GPU Worker 的数量）成线性关系。例如，拆分64个GPU（Nd = 64）将减少64倍的内存。通信量略有增加50％。

可以看到，ZeRO-3 是可以使得显存使用线性降低的模式。

gaocegege commented 3 years ago

举个例子，比如 1.5B 参数的 GPT-2 模型，Ψ = 1.5B。其中参数本身占用的显存是 2Ψ（FP16 2 字节），梯度要多占用一份，同样是 2Ψ。除此之外，优化器还需要占用额外的 KΨ。如果优化器是 Adam，会存储一份 FP32（4 字节）的参数，还有 momentum 和 variance。K = 3 * 4 = 12。也就是一共需要 24GB 显存。这跟模型参数需要的 2Ψ=3GB 相比，膨胀了 8 倍。

这还不是全部，显存占用 = 模型显存占用 + batch_size * 每个样本的显存占用。在 GPT-2 里，如果 batch size = 32，1K 长度的 sequence，那就需要 60GB。

The activation memory of a transformer-based model is proportional to the number of transformer layers × hidden dimensions × sequence length × batch size. For a GPT-2 like architecture, the total activations are about 12 × hidden dim × batch × seq length × transformer layers.

显存估算入门：https://www.jianshu.com/p/48ec6b49a597

gaocegege commented 3 years ago

优化器的显存优化比较简单，因为优化器里的状态是可以按照参数划分的，每次参数只会更新自己部分的 momentum 和 variance 等等。所以可以划分为 1/Nd 份，最后做一次 AllGather 就可以了。这是一个非常简单的设计，但是却可以显著降低显存的占用。原本是 (2 + 2 + K)Ψ，通过这一优化降低到了 4Ψ + KΨ/Nd。对于一个 7.5B，64 worker 的训练来说，显存使用下降 4 倍，这就是知识的力量。

因为这里会多一次 AllGather，所以按道理网络通信会增加，不过增加的数量级跟之前的 DP 应该是相同的，相当于 AllReduce 外多了一次 AllGather。带宽需求与 AllReduce 相同，所以不变（如果是 RingAllGather 的话）参考 http://gaocegege.com/Blog/kubernetes/mpi-1

gaocegege commented 3 years ago

对于梯度而言，每一层的梯度在反向传播的时候会被计算出来，可以通过 ReduceScatter 来分发给不同的 worker，来更新不同的参数。这样梯度的占用可以从 2Ψ 到 2Ψ/Nd，总体的占用就变成了 2Ψ + (K + 2)Ψ/Nd。

文章里提到了 ReduceScatter 采用了一种独特的实现。目测跟 AllReduce 是一个数量级的通信量，带宽需求也相同

Effectively this is a Reduce-Scatter operation, where gradients corresponding to different parameters are reduced to different process. To make this more efficient in practice, we use a bucketization strategy, where we bucketize all the gradients corresponding to a particular partition, and perform reduction on the entire bucket at once. This is similar in spirit to how NVIDIA’s AMP [25] optimizer bucketizes the all-reduce gradient computation to overlap communication and computation. In our case we perform a reduction instead of an all-reduce at the partition boundaries to reduce memory footprint and overlap computation and communication

gaocegege commented 3 years ago

对参数的处理就稍微有点不那么好看了。对参数进行分区后，如果在前向或者后向需要其他的参数，就需要通过 broadcast 的方式获得。文章说乍一看，这个会使得通信量大幅提高，但是实际上并没有那么夸张，大概只是原来的 1.5 倍。这样之后显存的使用量会降低到原来的 (K + 4)Ψ/Nd

gaocegege commented 3 years ago

GPU 显存在不同的阶段的降低程度

Jack47 commented 3 years ago

👍，我之前也看过，总结到这里了：https://github.com/Jack47/hack-SysML/blob/master/papers/ZeRO.md 之前看的比较粗，最近应该还会看一遍。回头再参考下你的笔记👏

gaocegege commented 3 years ago

在思考为什么之前没有人提出这样的设计，@VoVAllen 给了个思路说之前数据并行都是小模型，因此没有这样的需求。我感觉很合理。在 Transformer 出来之后，才有了这样的新需求。

还有一个问题是有没有可能把这个工作放到推荐领域应用？还留待调研

Jack47 commented 2 years ago

在思考为什么之前没有人提出这样的设计，@VoVAllen 给了个思路说之前数据并行都是小模型，因此没有这样的需求。我感觉很合理。在 Transformer 出来之后，才有了这样的新需求。

还有一个问题是有没有可能把这个工作放到推荐领域应用？还留待调研

除了模型变大了（发现模型越大，效果越好），数据集逐步变大也是另一个原因。ZeRO 系列主要是解决数据并行下，显存不够用的问题，把Optimizer States, Parameter 等进行了 Sharding。

对于推荐领域，好像都是 Parameter Server 这种异步更新的架构，它本身就是模型特征数量数亿记，所以进行了 sharding 的。跟 DDP这种同步更新(all-reduce)的方式不太一样，属于机器学习中的另一种架构了

最近微信的 PatricStar，属于对微软 DeepSpeed 的Sharding的一些改进，前段时间我也看了看，不过笔记没完全传上去

gaocegege commented 2 years ago

对于推荐领域，好像都是 Parameter Server 这种异步更新的架构，它本身就是模型特征数量数亿记，所以进行了 sharding 的。跟 DDP这种同步更新(all-reduce)的方式不太一样，属于机器学习中的另一种架构了

现在也有一些部分用 All-Reduce 去做的趋势

Jack47 commented 2 years ago

这两天又重新看了下，这篇论文还是挺牛逼的，应该是现在大模型训练的标配了，主要使用很方便，可以和现有的 DDP 这种方式无缝衔接，与 MP，PP 等大模型训练的方式相比，对研究员而言可以无须改模型，即无痛使用，而且计算效率高