train_ppo_llama_ray_70b.sh run two H800 machine error

OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)

https://openrlhf.readthedocs.io/

Apache License 2.0

1.71k stars 160 forks source link

Closed yangzhipeng1108 closed 3 weeks ago

yangzhipeng1108 commented 3 weeks ago

hijkzzz commented 3 weeks ago

Is there an NCCL connection between the two machines (required by vLLM weights sync) If not, you need to hack the code here to support sync weights using gloo . https://github.com/OpenLLMAI/OpenRLHF/blob/main/openrlhf/trainer/ray/ppo_actor.py#L85 At last, please use vLLM v0.42 due to there is a bug for vLLM 0.43.