Closed skyssj closed 4 years ago
This may be caused by the old tracker not being cleaned up, I fixed this in https://github.com/alibaba/graph-learn/pull/5 , you can try again. It should also be helpful to check the specific cause through python2.7.log.
I patch the fix but problem still. Looks like you can not remove the --tracker
directory directly. A mkdir -p
like creation is needed.
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0402 16:29:28.791065 3999 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.791113 3999 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
############# Server init start #############
E0402 16:29:28.791867 3996 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.792384 3996 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
@ 0x7f08d89f619a google::LogMessage::Fail()
@ 0x7f08d89f60de google::LogMessage::SendToLog()
@ 0x7f08d89f59fc google::LogMessage::Flush()
@ 0x7f08d89f9549 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f08d89cfeaf graphlearn::NamingEngine::NamingEngine()
@ 0x7f08d89d02c4 graphlearn::NamingEngine::GetInstance()
@ 0x7f08d89d3917 graphlearn::DistributeService::DistributeService()
@ 0x7f08d89b4a55 graphlearn::ServerImpl::RegisterDistributeService()
@ 0x7f08d89b4b5e graphlearn::ServerImpl::Start()
@ 0x7f08d8ed3075 _ZZN8pybind1112cpp_function10initializeIZNS0_C4IvN10graphlearn6ServerEIEINS_4nameENS_9is_methodENS_7siblingEEEEMT0_FT_DpT1_EDpRKT2_EUlPS4_E_vISI_EIS5_S6_S7_EEEvOS9_PFS8_SB_ESH_ENUlRNS_6detail13function_callEE1_4_FUNESP_
@ 0x7f08d8ece039 pybind11::cpp_function::dispatcher()
@ 0x7f08e04e0577 PyEval_EvalFrameEx
@ 0x7f08e04e2a99 PyEval_EvalCodeEx
@ 0x7f08e04dff68 PyEval_EvalFrameEx
@ 0x7f08e04e2a99 PyEval_EvalCodeEx
@ 0x7f08e04dff68 PyEval_EvalFrameEx
@ 0x7f08e04e2a99 PyEval_EvalCodeEx
main
@ 0x7f08e04dff68 PyEval_EvalFrameEx
@ 0x7f08e04e2a99 PyEval_EvalCodeEx
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0402 16:29:28.841421 3997 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.841472 3997 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
@ 0x7f08e04e2cba PyEval_EvalCode
@ 0x7f08e04fc01d run_mod
@ 0x7f08e04fd1c8 PyRun_FileExFlags
@ 0x7f08e04fe3e8 PyRun_SimpleFileExFlags
@ 0x7f08e051067c Py_Main
@ 0x7f08df733c05 __libc_start_main
@ 0x40071e (unknown)
main
############# Server init start #############
E0402 16:29:28.859108 3998 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.859401 3998 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
@ 0x7f368b96619a google::LogMessage::Fail()
@ 0x7f368b9660de google::LogMessage::SendToLog()
./run.sh: line 34: 3996 Aborted python dist_train.py --tracker=./distributed --ps_hosts=${PS_HOSTS} --worker_hosts=${WK_HOSTS} --job_name=ps --task_index=0
./run.sh: line 34: 3997 Aborted python dist_train.py --tracker=./distributed --ps_hosts=${PS_HOSTS} --worker_hosts=${WK_HOSTS} --job_name=worker --task_index=0
./run.sh: line 34: 3999 Aborted python dist_train.py --tracker=./distributed --ps_hosts=${PS_HOSTS} --worker_hosts=${WK_HOSTS} --job_name=worker --task_index=1
@ 0x7f368b9659fc google::LogMessage::Flush()
@ 0x7f368b969549 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f368b93feaf graphlearn::NamingEngine::NamingEngine()
@ 0x7f368b9402c4 graphlearn::NamingEngine::GetInstance()
@ 0x7f368b943917 graphlearn::DistributeService::DistributeService()
@ 0x7f368b924a55 graphlearn::ServerImpl::RegisterDistributeService()
@ 0x7f368b924b5e graphlearn::ServerImpl::Start()
@ 0x7f368be43075 _ZZN8pybind1112cpp_function10initializeIZNS0_C4IvN10graphlearn6ServerEIEINS_4nameENS_9is_methodENS_7siblingEEEEMT0_FT_DpT1_EDpRKT2_EUlPS4_E_vISI_EIS5_S6_S7_EEEvOS9_PFS8_SB_ESH_ENUlRNS_6detail13function_callEE1_4_FUNESP_
@ 0x7f368be3e039 pybind11::cpp_function::dispatcher()
@ 0x7f3693450577 PyEval_EvalFrameEx
@ 0x7f3693452a99 PyEval_EvalCodeEx
@ 0x7f369344ff68 PyEval_EvalFrameEx
@ 0x7f3693452a99 PyEval_EvalCodeEx
@ 0x7f369344ff68 PyEval_EvalFrameEx
@ 0x7f3693452a99 PyEval_EvalCodeEx
@ 0x7f369344ff68 PyEval_EvalFrameEx
@ 0x7f3693452a99 PyEval_EvalCodeEx
@ 0x7f3693452cba PyEval_EvalCode
@ 0x7f369346c01d run_mod
@ 0x7f369346d1c8 PyRun_FileExFlags
@ 0x7f369346e3e8 PyRun_SimpleFileExFlags
@ 0x7f369348067c Py_Main
@ 0x7f36926a3c05 __libc_start_main
@ 0x40071e (unknown)
BTW, I cleaned up --tracker
directory manually, and got some interesting log. Does that cause by
using a local filesystem instead of a NFS?
graphlearn.xxxx.INFO.20200402-163642.5092
...
W0402 16:36:42.347764 5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed
W0402 16:36:42.348039 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:42.348053 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
I0402 16:36:42.348098 5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:43.348107 5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed
W0402 16:36:43.348150 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:43.348160 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
I0402 16:36:43.348214 5182 naming_engine.cc:159] Refresh endpoints count: 0
I0402 16:36:43.348330 5092 naming_engine.cc:100] Update endpoint id: 0, address: , filepath: ./distributed/endpoints/0
I0402 16:36:43.348413 5092 coordinator.cc:190] Coordinator sink start/
I0402 16:36:43.348430 5092 coordinator.cc:216] Sink ./distributed/start/0OK
I0402 16:36:44.348294 5181 coordinator.cc:190] Coordinator sink
I0402 16:36:44.348353 5181 coordinator.cc:216] Sink ./distributed/startedOK
I0402 16:36:44.348357 5181 coordinator.cc:106] Master sync started.
W0402 16:36:44.348363 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:44.348378 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:44.348367 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:44.348412 5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:44.348418 5182 naming_engine.cc:159] Refresh endpoints count: 0
I0402 16:36:44.363948 5312 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.363970 5319 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.364365 5315 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.364580 5320 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.364948 5313 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.364956 5314 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.366432 5316 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.366991 5318 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.367408 5317 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.369257 5328 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.369416 5322 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
W0402 16:36:44.369658 5332 channel_manager.cc:100] Waiting for all servers started: 0/2
I0402 16:36:44.369700 5330 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.371842 5321 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.371852 5324 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.371871 5335 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.371891 5323 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.371901 5325 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.371906 5326 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.371908 5340 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.371932 5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.371938 5335 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.371976 5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.372058 5341 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.372062 5338 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.372076 5334 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.372140 5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.372221 5338 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.372298 5342 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.372383 5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.372404 5327 notification.cc:126] RpcNotification:Start req_type:UpdateEdges size:2
I0402 16:36:44.372439 5342 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
I0402 16:36:44.372475 5342 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges remote_id:0 total:2
W0402 16:36:45.348466 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:45.348598 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:45.348598 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:45.350410 5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:45.350420 5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:46.350499 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:46.350539 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:46.350548 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:46.350564 5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:46.350572 5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:47.350625 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:47.350672 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:47.350684 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:47.350706 5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:47.350713 5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:48.350780 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:48.350838 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:48.350852 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:48.350885 5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:48.350893 5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:49.351042 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:49.351099 5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:49.351107 5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:49.351315 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:49.351348 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:50.351255 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:50.351312 5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:50.352686 5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:50.351442 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:50.352722 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:51.352807 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:51.352880 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:51.352929 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:51.352946 5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:51.352957 5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:52.352988 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:52.353049 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
...
graphlearn.xxxx.WARNING.20200402-163642.5092
W0402 16:36:42.347764 5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed
W0402 16:36:42.348039 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:42.348053 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:43.348107 5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed
W0402 16:36:43.348150 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:43.348160 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:44.348363 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:44.348378 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:44.348367 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:44.348412 5182 naming_engine.cc:154] Invalid endpoint file: 1
W0402 16:36:44.369658 5332 channel_manager.cc:100] Waiting for all servers started: 0/2
W0402 16:36:45.348466 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:45.348598 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:45.348598 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:45.350410 5182 naming_engine.cc:154] Invalid endpoint file: 1
W0402 16:36:46.350499 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:46.350539 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:46.350548 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:46.350564 5182 naming_engine.cc:154] Invalid endpoint file: 1
W0402 16:36:47.350625 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:47.350672 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:47.350684 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:47.350706 5182 naming_engine.cc:154] Invalid endpoint file: 1
W0402 16:36:48.350780 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:48.350838 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:48.350852 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:48.350885 5182 naming_engine.cc:154] Invalid endpoint file: 1
W0402 16:36:49.351042 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:49.351099 5182 naming_engine.cc:154] Invalid endpoint file: 1
W0402 16:36:49.351315 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:49.351348 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:50.351255 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:50.351312 5182 naming_engine.cc:154] Invalid endpoint file: 1
W0402 16:36:50.351442 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:50.352722 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:51.352807 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:51.352880 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:51.352929 5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:51.352946 5182 naming_engine.cc:154] Invalid endpoint file: 1
W0402 16:36:52.352988 5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:52.353049 5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:52.353111 5182 naming_engine.cc:154] Invalid endpoint file: 0
The log shows the GL tracker dir still has not been cleaned up, for your case, run rm -rf ./distributed/*
, to clean up tracker. You may need add sleep 1
after python dist_train.py --index 0
, so that TF&GL can start and exit in the correct order.
First of all, thank you guys for opensource such an amazing project.
I try to follow THIS manual play with distributed training on a single machine, but fail to start training process.
Here is my script to start ps and worker process.
And also I add some log in Graph.init() function( https://github.com/alibaba/graph-learn/blob/master/graphlearn/python/graph.py ), but can not see "############# Server init done #############" been printout.
Anything I can get list below, it's keep printing
Invalid endpoint file: 0
till the end of the world.Any clue? Thank you!