OpenLLMAI / OpenRLHF

An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & Mixtral)
https://openrlhf.readthedocs.io/
Apache License 2.0
1.71k stars 160 forks source link

Status message: Unexpected error occurred: The actor 2c5251641e72297b4e3f4d7f01000000 is unavailable #339

Open lusongshuo-mt opened 2 days ago

lusongshuo-mt commented 2 days ago

使用多节点(3台8 * 80G A100)运行 train_ppo_ray.sh 经常会遇到以下问题,各节点环境配置一致 1719838437456

hijkzzz commented 1 day ago

没遇到过这个问题,可以debug一下