PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.13k stars 5.55k forks source link

mpi fleet分布式训练 报错:PaddleCheckError: internal error in RPCClient #21089

Closed maosengshulei closed 1 year ago

maosengshulei commented 4 years ago
guru4elephant commented 4 years ago

请提供完整的可复现信息,方便我们排查

maosengshulei commented 4 years ago

请提供完整的可复现信息,方便我们排查

Fri Nov 8 14:58:45 2019[1,1]<stdout>:---------------------- Fri Nov 8 14:58:45 2019[1,1]<stdout>:Error Message Summary: Fri Nov 8 14:58:45 2019[1,1]<stdout>:---------------------- Fri Nov 8 14:58:45 2019[1,1]<stdout>:PaddleCheckError: internal error in RPCClient at [/paddle/paddle/fluid/operators/distributed/parameter_prefetch.cc:129] Fri Nov 8 14:58:45 2019[1,1]<stdout>: [operator < distributed_lookup_table > error] Fri Nov 8 14:58:45 2019[1,1]<stdout>:W1108 14:58:45.668053 7834 init.cc:212] *** Aborted at 1573196325 (unix time) try "date -d @1573196325" if you are using GNU date *** Fri Nov 8 14:58:45 2019[1,8]<stdout>:W1108 14:58:45.661906 7855 init.cc:212] @ 0x7fb9e54bb3f7 __GI_raise Fri Nov 8 14:58:45 2019[1,9]<stdout>:W1108 14:58:45.676427 7798 init.cc:212] *** SIGABRT (@0x1f800001dd5) received by PID 7637 (TID 0x7f121d7fb700) from PID 7637; stack trace: *** Fri Nov 8 14:58:45 2019[1,8]<stdout>:W1108 14:58:45.666946 7855 init.cc:212] @ 0x7fb9e54bc7d8 __GI_abort Fri Nov 8 14:58:45 2019[1,9]<stdout>:W1108 14:58:45.680546 7798 init.cc:212] @ 0x7f147e6af160 (unknown) Fri Nov 8 14:58:45 2019[1,9]<stdout>:W1108 14:58:45.687276 7798 init.cc:212] @ 0x7f147dc1d3f7 __GI_raise Fri Nov 8 14:58:45 2019[1,9]<stdout>:W1108 14:58:45.691381 7798 init.cc:212] @ 0x7f147dc1e7d8 __GI_abort Fri Nov 8 14:58:45 2019[1,18]<stdout>:W1108 14:58:45.679422 7812 init.cc:212] @ 0x7fa5e5dbcbf5 __gnu_cxx::__verbose_terminate_handler() Fri Nov 8 14:58:45 2019[1,18]<stdout>:W1108 14:58:45.681640 7812 init.cc:212] @ 0x7fa5e5dbae06 __cxxabiv1::__terminate() Fri Nov 8 14:58:45 2019[1,18]<stdout>:W1108 14:58:45.684181 7812 init.cc:212] @ 0x7fa5e5dbae33 std::terminate() Fri Nov 8 14:58:45 2019[1,18]<stdout>:W1108 14:58:45.686028 7812 init.cc:212] @ 0x7fa5e5e0d935 execute_native_thread_routine Fri Nov 8 14:58:45 2019[1,18]<stdout>:W1108 14:58:45.688583 7812 init.cc:212] @ 0x7fa6412ef1c3 start_thread Fri Nov 8 14:58:45 2019[1,9]<stdout>:W1108 14:58:45.702481 7798 init.cc:212] @ 0x7f1423173c65 __gnu_cxx::__verbose_terminate_handler() Fri Nov 8 14:58:45 2019[1,18]<stdout>:W1108 14:58:45.691437 7812 init.cc:212] @ 0x7fa64091712d __clone Fri Nov 8 14:58:45 2019[1,9]<stdout>:W1108 14:58:45.705277 7798 init.cc:212] @ 0x7f1423171e06 __cxxabiv1::__terminate() Fri Nov 8 14:58:45 2019[1,18]<stdout>:W1108 14:58:45.694069 7812 init.cc:212] @ 0x0 (unknown) Fri Nov 8 14:58:45 2019[1,9]<stdout>:W1108 14:58:45.707976 7798 init.cc:212] @ 0x7f1423171e33 std::terminate() Fri Nov 8 14:58:45 2019[1,9]<stdout>:W1108 14:58:45.710459 7798 init.cc:212] @ 0x7f14231c4935 execute_native_thread_routine Fri Nov 8 14:58:45 2019[1,9]<stdout>:W1108 14:58:45.715113 7798 init.cc:212] @ 0x7f147e6a71c3 start_thread Fri Nov 8 14:58:45 2019[1,9]<stdout>:W1108 14:58:45.719029 7798 init.cc:212] @ 0x7f147dccf12d __clone Fri Nov 8 14:58:45 2019[1,9]<stdout>:W1108 14:58:45.722978 7798 init.cc:212] @ 0x0 (unknown) Fri Nov 8 14:58:45 2019[1,8]<stdout>:W1108 14:58:45.735107 7855 init.cc:212] @ 0x7fb98aa11c65 __gnu_cxx::__verbose_terminate_handler() Fri Nov 8 14:58:45 2019[1,15]<stdout>:W1108 14:58:45.741395 7831 init.cc:212] @ 0x7f327f30cbf5 __gnu_cxx::__verbose_terminate_handler() Fri Nov 8 14:58:45 2019[1,8]<stdout>:W1108 14:58:45.738493 7855 init.cc:212] @ 0x7fb98aa0fe06 __cxxabiv1::__terminate() Fri Nov 8 14:58:45 2019[1,15]<stdout>:W1108 14:58:45.744084 7831 init.cc:212] @ 0x7f327f30ae06 __cxxabiv1::__terminate() Fri Nov 8 14:58:45 2019[1,8]<stdout>:W1108 14:58:45.741789 7855 init.cc:212] @ 0x7fb98aa0fe33 std::terminate() Fri Nov 8 14:58:45 2019[1,15]<stdout>:W1108 14:58:45.746824 7831 init.cc:212] @ 0x7f327f30ae33 std::terminate() Fri Nov 8 14:58:45 2019[1,8]<stdout>:W1108 14:58:45.744763 7855 init.cc:212] @ 0x7fb98aa62935 execute_native_thread_routine Fri Nov 8 14:58:45 2019[1,1]<stdout>:W1108 14:58:45.755753 7834 init.cc:212] PC: @ 0x0 (unknown) Fri Nov 8 14:58:45 2019[1,15]<stdout>:W1108 14:58:45.749155 7831 init.cc:212] @ 0x7f327f35d935 execute_native_thread_routine Fri Nov 8 14:58:45 2019[1,8]<stdout>:W1108 14:58:45.749181 7855 init.cc:212] @ 0x7fb9e5f451c3 start_thread Fri Nov 8 14:58:45 2019[1,15]<stdout>:W1108 14:58:45.754431 7831 init.cc:212] @ 0x7f32da83f1c3 start_thread Fri Nov 8 14:58:45 2019[1,8]<stdout>:W1108 14:58:45.753257 7855 init.cc:212] @ 0x7fb9e556d12d __clone Fri Nov 8 14:58:45 2019[1,15]<stdout>:W1108 14:58:45.758168 7831 init.cc:212] @ 0x7f32d9e6712d __clone Fri Nov 8 14:58:45 2019[1,15]<stdout>:W1108 14:58:45.760608 7831 init.cc:212] @ 0x0 (unknown) Fri Nov 8 14:58:45 2019[1,11]<stdout>:W1108 14:58:45.754195 7850 init.cc:212] @ 0x7f62ae306c65 __gnu_cxx::__verbose_terminate_handler() Fri Nov 8 14:58:45 2019[1,8]<stdout>:W1108 14:58:45.758009 7855 init.cc:212] @ 0x0 (unknown) Fri Nov 8 14:58:45 2019[1,11]<stdout>:W1108 14:58:45.756856 7850 init.cc:212] @ 0x7f62ae304e06 __cxxabiv1::__terminate() Fri Nov 8 14:58:45 2019[1,11]<stdout>:W1108 14:58:45.759858 7850 init.cc:212] @ 0x7f62ae304e33 std::terminate() Fri Nov 8 14:58:45 2019[1,11]<stdout>:W1108 14:58:45.762226 7850 init.cc:212] @ 0x7f62ae357935 execute_native_thread_routine Fri Nov 8 14:58:45 2019[1,1]<stdout>:W1108 14:58:45.771031 7834 init.cc:212] *** SIGABRT (@0x1fa00001df7) received by PID 7671 (TID 0x7f4f26bfd700) from PID 7671; stack trace: *** Fri Nov 8 14:58:45 2019[1,1]<stdout>:W1108 14:58:45.775074 7834 init.cc:212] @ 0x7f508940a160 (unknown) Fri Nov 8 14:58:45 2019[1,11]<stdout>:W1108 14:58:45.765863 7850 init.cc:212] @ 0x7f630983a1c3 start_thread Fri Nov 8 14:58:45 2019[1,1]<stdout>:W1108 14:58:45.779676 7834 init.cc:212] @ 0x7f50889783f7 __GI_raise Fri Nov 8 14:58:45 2019[1,11]<stdout>:W1108 14:58:45.769556 7850 init.cc:212] @ 0x7f6308e6212d __clone Fri Nov 8 14:58:45 2019[1,1]<stdout>:W1108 14:58:45.783493 7834 init.cc:212] @ 0x7f50889797d8 __GI_abort Fri Nov 8 14:58:45 2019[1,11]<stdout>:W1108 14:58:45.778846 7850 init.cc:212] @ 0x0 (unknown) Fri Nov 8 14:58:45 2019[1,2]<stdout>:terminate called recursively Fri Nov 8 14:58:45 2019[1,2]<stdout>:terminate called after throwing an instance of 'paddle::platform::EnforceNotMetterminate called after throwing an instance of 'terminate called after throwing an instance of '' Fri Nov 8 14:58:45 2019[1,2]<stdout>:paddle::platform::EnforceNotMet' Fri Nov 8 14:58:45 2019[1,2]<stdout>:paddle::platform::EnforceNotMet' Fri Nov 8 14:58:45 2019[1,2]<stdout>: what(): Fri Nov 8 14:58:45 2019[1,2]<stdout>: Fri Nov 8 14:58:45 2019[1,2]<stdout>:-------------------------------------------- Fri Nov 8 14:58:45 2019[1,2]<stdout>:C++ Call Stacks (More useful to developers): Fri Nov 8 14:58:45 2019[1,2]<stdout>:-------------------------------------------- Fri Nov 8 14:58:45 2019[1,2]<stdout>:0 std::string paddle::platform::GetTraceBackString<char const*>(char const*&&, char const*, int) Fri Nov 8 14:58:45 2019[1,2]<stdout>:1 paddle::platform::EnforceNotMet::EnforceNotMet(std::__exception_ptr::exception_ptr, char const*, int) Fri Nov 8 14:58:45 2019[1,2]<stdout>:2 paddle::operators::distributed::prefetch_core(std::vector<long, std::allocator<long> > const&, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, std::vector<long, std::allocator<long> > const&, paddle::framework::ExecutionContext const&, paddle::framework::Scope const&, std::unordered_map<long, std::vector<float, std::allocator<float> >, std::hash<long>, std::equal_to<long>, std::allocator<std::pair<long const, std::vector<float, std::allocator<float> > > > >*) Fri Nov 8 14:58:45 2019[1,2]<stdout>:3 paddle::operators::distributed::prefetchs(std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::string const&, bool, std::vector<std::string, std::allocator<std::string> > const&, std::vector<std::string, std::allocator<std::string> > const&, std::vector<long, std::allocator<long> > const&, paddle::framework::ExecutionContext const&, paddle::framework::Scope const&) Fri Nov 8 14:58:45 2019[1,2]<stdout>:4 paddle::operators::DistributedLookupTableKernel<float>::Compute(paddle::framework::ExecutionContext const&) const Fri Nov 8 14:58:45 2019[1,2]<stdout>:5 std::_Function_handler<void (paddle::framework::ExecutionContext const&), paddle::framework::OpKernelRegistrarFunctor<paddle::platform::CPUPlace, false, 0ul, paddle::operators::DistributedLookupTableKernel<float> >::operator()(char const*, char const*, int) const::{lambda(paddle::framework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) Fri Nov 8 14:58:45 2019[1,2]<stdout>:6 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&, paddle::framework::RuntimeContext*) const Fri Nov 8 14:58:45 2019[1,2]<stdout>:7 paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) const Fri Nov 8 14:58:45 2019[1,2]<stdout>:8 paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, boost::variant<paddle::platform::CUDAPlace, paddle::platform::CPUPlace, paddle::platform::CUDAPinnedPlace, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost:: Fri Nov 8 14:58:45 2019[1,2]<stdout>:detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_, boost::detail::variant::void_> const&) Fri Nov 8 14:58:45 2019[1,2]<stdout>:9 paddle::framework::HogwildWorker::TrainFiles() Fri Nov 8 14:58:45 2019[1,2]<stdout>: Fri Nov 8 14:58:45 2019[1,2]<stdout>:------------------------------------------ Fri Nov 8 14:58:45 2019[1,2]<stdout>:Python Call Stacks (More useful to users): Fri Nov 8 14:58:45 2019[1,2]<stdout>:------------------------------------------ Fri Nov 8 14:58:45 2019[1,2]<stdout>: File "/home/disk1/task_data/history/20191108/10.app-user-20191108132340-44623--shulei_msd_mmoe_dnn_v1_20191108_paddlecloud/logs/workspace/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/framework.py", line 2444, in _insert_op Fri Nov 8 14:58:45 2019[1,2]<stdout>: op = Operator(block=self, desc=op_desc, *args, **kwargs) Fri Nov 8 14:58:45 2019[1,2]<stdout>: File "/home/disk1/task_data/history/20191108/10.app-user-20191108132340-44623--shulei_msd_mmoe_dnn_v1_20191108_paddlecloud/logs/workspace/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/transpiler/distribute_transpiler.py", line 478, in _update_remote_sparse_update_op Fri Nov 8 14:58:45 2019[1,2]<stdout>: "trainer_id": self.trainer_id Fri Nov 8 14:58:45 2019[1,2]<stdout>: File "/home/disk1/task_data/history/20191108/10.app-user-20191108132340-44623--shulei_msd_mmoe_dnn_v1_20191108_paddlecloud/logs/workspace/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/transpiler/distribute_transpiler.py", line 796, in transpile Fri Nov 8 14:58:45 2019[1,2]<stdout>: self._update_remote_sparse_update_op(program, need_sparse_update_params) Fri Nov 8 14:58:45 2019[1,2]<stdout>: File "/home/disk1/task_data/history/20191108/10.app-user-20191108132340-44623--shulei_msd_mmoe_dnn_v1_20191108_paddlecloud/logs/workspace/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/incubate/fleet/parameter_server/distribute_transpiler/__init__.py", line 259, in _transpile Fri Nov 8 14:58:45 2019[1,2]<stdout>: sync_mode=config.sync_mode) Fri Nov 8 14:58:45 2019[1,2]<stdout>: File "/home/disk1/task_data/history/20191108/10.app-user-20191108132340-44623--shulei_msd_mmoe_dnn_v1_20191108_paddlecloud/logs/workspace/python27-gcc482/lib/python2.7/site-packages/paddle/fluid/incubate/fleet/parameter_server/distribute_transpiler/__init__.py", line 402, in minimize Fri Nov 8 14:58:45 2019[1,2]<stdout>: fleet._transpile(config=self._strategy) Fri Nov 8 14:58:45 2019[1,2]<stdout>: File "fleet_cluster_dnn_train.py", line 134, in train Fri Nov 8 14:58:45 2019[1,2]<stdout>: optimizer.minimize(total_loss) Fri Nov 8 14:58:45 2019[1,2]<stdout>: File "fleet_cluster_dnn_train.py", line 186, in <module> Fri Nov 8 14:58:45 2019[1,2]<stdout>: train() Fri Nov 8 14:58:45 2019[1,2]<stdout>: Fri Nov 8 14:58:45 2019[1,2]<stdout>:---------------------- Fri Nov 8 14:58:45 2019[1,2]<stdout>:Error Message Summary: Fri Nov 8 14:58:45 2019[1,2]<stdout>:---------------------- Fri Nov 8 14:58:45 2019[1,2]<stdout>:PaddleCheckError: internal error in RPCClient at [/paddle/paddle/fluid/operators/distributed/parameter_prefetch.cc:129] Fri Nov 8 14:58:45 2019[1,2]<stdout>: [operator < distributed_lookup_table > error] Fri Nov 8 14:58:45 2019[1,2]<stdout>: what(): Fri Nov 8 14:58:45 2019[1,2]<stdout>: mpi链接如下:http://10.76.127.44:8910/fileview.html?type=logsdir&path=/&instance=0.app-user-20191108132340-44623--shulei_msd_mmoe_dnn_v1_20191108_paddlecloud