Closed yangzhipeng1108 closed 3 weeks ago
Is there an NCCL connection between the two machines (required by vLLM weights sync) If not, you need to hack the code here to support sync weights using gloo . https://github.com/OpenLLMAI/OpenRLHF/blob/main/openrlhf/trainer/ray/ppo_actor.py#L85 At last, please use vLLM v0.42 due to there is a bug for vLLM 0.43.