OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.71k stars 160 forks source link

train_ppo_llama_ray_70b.sh run two H800 machine error #316

Closed yangzhipeng1108 closed 3 weeks ago

yangzhipeng1108 commented 3 weeks ago

image image

hijkzzz commented 3 weeks ago

Is there an NCCL connection between the two machines (required by vLLM weights sync) If not, you need to hack the code here to support sync weights using gloo . https://github.com/OpenLLMAI/OpenRLHF/blob/main/openrlhf/trainer/ray/ppo_actor.py#L85 At last, please use vLLM v0.42 due to there is a bug for vLLM 0.43.