PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.27k stars 5.6k forks source link

Pserver: Port conflict has not been solved! #18575

Closed yangjing14 closed 1 year ago

yangjing14 commented 5 years ago

   1)PaddlePaddle版本:Fluid 1.3    2)CPU:    3)GPU:无    4)系统环境:MPI集群

trainer.log F0710 13:11:24.494246 34492 grpc_client.cc:408] GetRPC name:[emb_query.block0], ep:[10.90.104.41:62003], status:[-1] meets grpc error, error_code:4 error_message:Deadline Exceeded error_details: Check failure stack trace: @ 0x7f672a10dc0d google::LogMessage::Fail() @ 0x7f672a1116bc google::LogMessage::SendToLog() @ 0x7f672a10d733 google::LogMessage::Flush() @ 0x7f672a112bce google::LogMessageFatal::~LogMessageFatal() @ 0x7f672ac1530a paddle::operators::distributed::GRPCClient::Proceed() @ 0x7f6791a518a0 execute_native_thread_routine @ 0x7f679d5441c3 start_thread @ 0x7f679cb6c12d __clone @ (nil) (unknown) .//paddle/start_trainer.sh: line 112: 31533 Aborted (core dumped) python -u train.py

yxfGrace commented 5 years ago

mpi作业中pserver和trainer通信的端口是paddlecloud分配的,可以提一个paddlecloud的icafe卡片,我们后面来跟进下,卡片提交地址 http://newicafe.baidu.com/issues/space/paddle-cloud-user

yangjing14 commented 5 years ago

mpi作业中pserver和trainer通信的端口是paddlecloud分配的,可以提一个paddlecloud的icafe卡片,我们后面来跟进下,卡片提交地址 http://newicafe.baidu.com/issues/space/paddle-cloud-user

你好,icafe卡片以提交,麻烦跟进,谢谢 http://newicafe.baidu.com/issue/paddle-cloud-user-875/show?from=page