OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.71k stars 160 forks source link

ray多节点训练下deepspeed zero3的切分还是按照 node数*8卡来切分吗? #317

Closed lma-c4d closed 3 weeks ago

lma-c4d commented 3 weeks ago

看了这个ray actor初始化代码: https://github.com/OpenLLMAI/OpenRLHF/blob/ea54281e818ebb084a10949a70d45341a009a8c5/openrlhf/trainer/ray/launcher.py#L158

有个疑问,按照这样写的话,deepspeed zero3的参数和优化器状态的切分是按照每个节点内切8份,还是能看到所有的node按照node数*8来切?

hijkzzz commented 3 weeks ago

你需要指定节点的数量的 仅在指定节点上运行