OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.71k stars 160 forks source link

[Question] Is multi-nodes stage 3 model loading supported? #320

Closed mickelliu closed 3 weeks ago

mickelliu commented 3 weeks ago

I wish to try out stage 3 without Adam offloading, but I would imagine you would have to use multiple nodes to just hold the actor's weights, I have briefly tried to set the number of GPU per actor node to 16 but to no avail, so I assume this is not yet supported? And I wonder if this is technically feasible with the OpenRLHF ray-based framework so I can spend some time looking into it.

hijkzzz commented 3 weeks ago

please set node number to 2 and gpu per node to 8 and zero stage to 3 if you use the cluster like dgxa100

mickelliu commented 3 weeks ago

got it, will try it out, thank you!