Closed yangjing14 closed 1 year ago
mpi作业中pserver和trainer通信的端口是paddlecloud分配的,可以提一个paddlecloud的icafe卡片,我们后面来跟进下,卡片提交地址 http://newicafe.baidu.com/issues/space/paddle-cloud-user
mpi作业中pserver和trainer通信的端口是paddlecloud分配的,可以提一个paddlecloud的icafe卡片,我们后面来跟进下,卡片提交地址 http://newicafe.baidu.com/issues/space/paddle-cloud-user
你好,icafe卡片以提交,麻烦跟进,谢谢 http://newicafe.baidu.com/issue/paddle-cloud-user-875/show?from=page
1)PaddlePaddle版本:Fluid 1.3 2)CPU: 3)GPU:无 4)系统环境:MPI集群
训练信息 1)MPI集群
问题描述: 从Paddle Cloud在MPI集群上启动任务,发现Pserver在启动过程中总是检测到端口冲突,最后所有的候选port都重试后还是没有解决。导致后续train的过程中包grpc错误
任务链接:http://10.90.104.41:8900/fileview.html?path=/home/disk1/normandy/maybach/app-user-20190710110816-2163/
相关日志: job.err.log ++ echo '[pserver port] Port 62004 conflict!' ++ echo '[pserver port] Port 62000 conflict!' ++ echo '[pserver port] Port 62001 conflict!' ++ echo '[pserver port] Port 62002 conflict!' ++ echo '[pserver port] Port 62003 conflict!' ++ echo '[INFO] There is no port left can be bind in [sys_pserver_alter_ports_list]' ++ echo '[pserver port] All alternative ports have been used! Port conflict has not been solved!'
trainer.log F0710 13:11:24.494246 34492 grpc_client.cc:408] GetRPC name:[emb_query.block0], ep:[10.90.104.41:62003], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details: Check failure stack trace: @ 0x7f672a10dc0d google::LogMessage::Fail() @ 0x7f672a1116bc google::LogMessage::SendToLog() @ 0x7f672a10d733 google::LogMessage::Flush() @ 0x7f672a112bce google::LogMessageFatal::~LogMessageFatal() @ 0x7f672ac1530a paddle::operators::distributed::GRPCClient::Proceed() @ 0x7f6791a518a0 execute_native_thread_routine @ 0x7f679d5441c3 start_thread @ 0x7f679cb6c12d __clone @ (nil) (unknown) .//paddle/start_trainer.sh: line 112: 31533 Aborted (core dumped) python -u train.py