PaddlePaddle / PARL

A high-performance distributed training framework for Reinforcement Learning
https://parl.readthedocs.io/
Apache License 2.0
3.22k stars 816 forks source link

ES单机多卡训练报错 ERR [xparl] lost connection with a job #1111

Open hadoop2xu opened 11 months ago

hadoop2xu commented 11 months ago

ES 模型单机多卡训练 1、执行 xparl start --port 8837 --cpu_num 48 2、执行 fleetrun train.py 报错如下: [07-13 15:41:57 Thread-12 @client.py:301] ERR [xparl] lost connection with a job, current actor num: 19 [07-13 15:41:57 Thread-52 @client.py:301] ERR [xparl] lost connection with a job, current actor num: 18 [07-13 15:41:57 Thread-50 @client.py:301] ERR [xparl] lost connection with a job, current actor num: 17 [07-13 15:41:58 Thread-60 @client.py:301] ERR [xparl] lost connection with a job, current actor num: 16