alibaba / graph-learn

An Industrial Graph Neural Network Framework
Apache License 2.0
1.28k stars 267 forks source link

Questions about graphsage dist_train.py #4

Closed skyssj closed 4 years ago

skyssj commented 4 years ago

First of all, thank you guys for opensource such an amazing project.

I try to follow THIS manual play with distributed training on a single machine, but fail to start training process.

Here is my script to start ps and worker process.

PS_HOSTS="127.0.0.1:2300,127.0.0.1:2311"
WK_HOSTS="127.0.0.1:2200,127.0.0.1:2222"

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=0 &

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=0 &

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=1 &

python dist_train.py \
  --tracker=./distributed \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=1 &

wait

And also I add some log in Graph.init() function( https://github.com/alibaba/graph-learn/blob/master/graphlearn/python/graph.py ), but can not see "############# Server init done #############" been printout.


    if job_name == "client":
      pywrap.set_client_id(task_index)
      self._client = pywrap.rpc_client()
      self._server = None
    else:
      print("############# Server init start #############")
      if job_name == "server":
        self._client = None
      if not tracker and kwargs.get("tracker"):
        tracker = kwargs["tracker"]
      if tracker:
        self._server = Server(task_index, server_count, tracker)
      else:
        self._server = Server(task_index, server_count)
      self._server.start()
      print("############# Server start done #############")
      self._server.init(self._edge_sources, self._node_sources)
      print("############# Server init done #############")
    return self

Anything I can get list below, it's keep printing Invalid endpoint file: 0 till the end of the world.

main                                                                                                                                        
WARNING: Logging before InitGoogleLogging() is written to STDERR                                                                            
I0402 13:10:49.755939 10816 naming_engine.cc:56] Connect naming engine ok: ./distributed/endpoints/
I0402 13:10:49.756223 10816 channel_manager.cc:94] Auto select server: 1  
W0402 13:10:49.756240 10816 channel_manager.cc:100] Waiting for all servers started: 0/2
W0402 13:10:49.756494 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:49.756530 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:49.756541 10904 naming_engine.cc:159] Refresh endpoints count: 0
2020-04-02 13:10:49.771019: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA                                                                                                            
2020-04-02 13:10:49.777325: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:2300, 1 -> 127.0.0.1:2311}                                                                                                          
2020-04-02 13:10:49.777366: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> 127.0.0.1:2200, 1 -> localhost:2222}                                                                                                      
2020-04-02 13:10:49.784454: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:2222                                                                                                                                          
main                                                                                                                                        
WARNING: Logging before InitGoogleLogging() is written to STDERR                                                                            
I0402 13:10:49.878661 10814 naming_engine.cc:56] Connect naming engine ok: ./distributed/endpoints/
I0402 13:10:49.878902 10814 channel_manager.cc:94] Auto select server: 0  
W0402 13:10:49.878921 10814 channel_manager.cc:100] Waiting for all servers started: 0/2
W0402 13:10:49.880380 10951 naming_engine.cc:154] Invalid endpoint file: 0
main                                                                                                                                        
W0402 13:10:49.880429 10951 naming_engine.cc:154] Invalid endpoint file: 1  
I0402 13:10:49.880441 10951 naming_engine.cc:159] Refresh endpoints count: 0
############# Server init start #############                                                                                               
2020-04-02 13:10:49.894944: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA                                                                                                            
2020-04-02 13:10:49.900562: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> 127.0.0.1:2300, 1 -> 127.0.0.1:2311}                                                                                                          
2020-04-02 13:10:49.900591: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2200, 1 -> 127.0.0.1:2222}                                                                                                      
2020-04-02 13:10:49.901519: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:2200                                                                                                                                          
main                                                                                                                                        
############# Server init start #############                                                                                               
W0402 13:10:50.756636 10904 naming_engine.cc:154] Invalid endpoint file: 0  
W0402 13:10:50.756687 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:50.756696 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:50.880582 10951 naming_engine.cc:154] Invalid endpoint file: 0  
W0402 13:10:50.880635 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:50.880697 10951 naming_engine.cc:159] Refresh endpoints count: 0
[2020-04-02 13:10:50.888773] Server started.                                                                                                
############# Server start done #############                                                                                                                                                                                                                                            
[2020-04-02 13:10:50.985136] Server started.                                                                                                                                                                                                                                             
############# Server start done #############                                                                                                                                                                                                                                            
W0402 13:10:51.756803 10904 naming_engine.cc:154] Invalid endpoint file: 0                                                                                                                                                                                                               
W0402 13:10:51.756860 10904 naming_engine.cc:154] Invalid endpoint file: 1                                                                  
I0402 13:10:51.756868 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:51.880851 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:51.880900 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:51.880908 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:52.756978 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:52.757043 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:52.757053 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:52.881058 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:52.881108 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:52.881115 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:53.757174 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:53.757233 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:53.757244 10904 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:53.881242 10951 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:53.881289 10951 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:53.881297 10951 naming_engine.cc:159] Refresh endpoints count: 0
W0402 13:10:54.757366 10904 naming_engine.cc:154] Invalid endpoint file: 0
W0402 13:10:54.757421 10904 naming_engine.cc:154] Invalid endpoint file: 1
I0402 13:10:54.757429 10904 naming_engine.cc:159] Refresh endpoints count: 0

Any clue? Thank you!

baoleai commented 4 years ago

This may be caused by the old tracker not being cleaned up, I fixed this in https://github.com/alibaba/graph-learn/pull/5 , you can try again. It should also be helpful to check the specific cause through python2.7.log.

skyssj commented 4 years ago

I patch the fix but problem still. Looks like you can not remove the --tracker directory directly. A mkdir -p like creation is needed.

WARNING: Logging before InitGoogleLogging() is written to STDERR
E0402 16:29:28.791065  3999 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.791113  3999 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
############# Server init start #############
E0402 16:29:28.791867  3996 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.792384  3996 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
    @     0x7f08d89f619a  google::LogMessage::Fail()
    @     0x7f08d89f60de  google::LogMessage::SendToLog()
    @     0x7f08d89f59fc  google::LogMessage::Flush()
    @     0x7f08d89f9549  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f08d89cfeaf  graphlearn::NamingEngine::NamingEngine()
    @     0x7f08d89d02c4  graphlearn::NamingEngine::GetInstance()
    @     0x7f08d89d3917  graphlearn::DistributeService::DistributeService()
    @     0x7f08d89b4a55  graphlearn::ServerImpl::RegisterDistributeService()
    @     0x7f08d89b4b5e  graphlearn::ServerImpl::Start()
    @     0x7f08d8ed3075  _ZZN8pybind1112cpp_function10initializeIZNS0_C4IvN10graphlearn6ServerEIEINS_4nameENS_9is_methodENS_7siblingEEEEMT0_FT_DpT1_EDpRKT2_EUlPS4_E_vISI_EIS5_S6_S7_EEEvOS9_PFS8_SB_ESH_ENUlRNS_6detail13function_callEE1_4_FUNESP_
    @     0x7f08d8ece039  pybind11::cpp_function::dispatcher()
    @     0x7f08e04e0577  PyEval_EvalFrameEx
    @     0x7f08e04e2a99  PyEval_EvalCodeEx
    @     0x7f08e04dff68  PyEval_EvalFrameEx
    @     0x7f08e04e2a99  PyEval_EvalCodeEx
    @     0x7f08e04dff68  PyEval_EvalFrameEx
    @     0x7f08e04e2a99  PyEval_EvalCodeEx
main
    @     0x7f08e04dff68  PyEval_EvalFrameEx
    @     0x7f08e04e2a99  PyEval_EvalCodeEx
WARNING: Logging before InitGoogleLogging() is written to STDERR
E0402 16:29:28.841421  3997 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.841472  3997 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
    @     0x7f08e04e2cba  PyEval_EvalCode
    @     0x7f08e04fc01d  run_mod
    @     0x7f08e04fd1c8  PyRun_FileExFlags
    @     0x7f08e04fe3e8  PyRun_SimpleFileExFlags
    @     0x7f08e051067c  Py_Main
    @     0x7f08df733c05  __libc_start_main
    @           0x40071e  (unknown)
main
############# Server init start #############
E0402 16:29:28.859108  3998 local_file_system.cc:340] Create local directory failed: ./distributed/endpoints/
F0402 16:29:28.859401  3998 naming_engine.cc:58] Connect naming engine failed: ./distributed/endpoints/
*** Check failure stack trace: ***
    @     0x7f368b96619a  google::LogMessage::Fail()
    @     0x7f368b9660de  google::LogMessage::SendToLog()
./run.sh: line 34:  3996 Aborted                 python dist_train.py --tracker=./distributed --ps_hosts=${PS_HOSTS} --worker_hosts=${WK_HOSTS} --job_name=ps --task_index=0
./run.sh: line 34:  3997 Aborted                 python dist_train.py --tracker=./distributed --ps_hosts=${PS_HOSTS} --worker_hosts=${WK_HOSTS} --job_name=worker --task_index=0
./run.sh: line 34:  3999 Aborted                 python dist_train.py --tracker=./distributed --ps_hosts=${PS_HOSTS} --worker_hosts=${WK_HOSTS} --job_name=worker --task_index=1
    @     0x7f368b9659fc  google::LogMessage::Flush()
    @     0x7f368b969549  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f368b93feaf  graphlearn::NamingEngine::NamingEngine()
    @     0x7f368b9402c4  graphlearn::NamingEngine::GetInstance()
    @     0x7f368b943917  graphlearn::DistributeService::DistributeService()
    @     0x7f368b924a55  graphlearn::ServerImpl::RegisterDistributeService()
    @     0x7f368b924b5e  graphlearn::ServerImpl::Start()
    @     0x7f368be43075  _ZZN8pybind1112cpp_function10initializeIZNS0_C4IvN10graphlearn6ServerEIEINS_4nameENS_9is_methodENS_7siblingEEEEMT0_FT_DpT1_EDpRKT2_EUlPS4_E_vISI_EIS5_S6_S7_EEEvOS9_PFS8_SB_ESH_ENUlRNS_6detail13function_callEE1_4_FUNESP_
    @     0x7f368be3e039  pybind11::cpp_function::dispatcher()
    @     0x7f3693450577  PyEval_EvalFrameEx
    @     0x7f3693452a99  PyEval_EvalCodeEx
    @     0x7f369344ff68  PyEval_EvalFrameEx
    @     0x7f3693452a99  PyEval_EvalCodeEx
    @     0x7f369344ff68  PyEval_EvalFrameEx
    @     0x7f3693452a99  PyEval_EvalCodeEx
    @     0x7f369344ff68  PyEval_EvalFrameEx
    @     0x7f3693452a99  PyEval_EvalCodeEx
    @     0x7f3693452cba  PyEval_EvalCode
    @     0x7f369346c01d  run_mod
    @     0x7f369346d1c8  PyRun_FileExFlags
    @     0x7f369346e3e8  PyRun_SimpleFileExFlags
    @     0x7f369348067c  Py_Main
    @     0x7f36926a3c05  __libc_start_main
    @           0x40071e  (unknown)

BTW, I cleaned up --tracker directory manually, and got some interesting log. Does that cause by using a local filesystem instead of a NFS?

graphlearn.xxxx.INFO.20200402-163642.5092

...

W0402 16:36:42.347764  5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed
W0402 16:36:42.348039  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:42.348053  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
I0402 16:36:42.348098  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:43.348107  5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed
W0402 16:36:43.348150  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:43.348160  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
I0402 16:36:43.348214  5182 naming_engine.cc:159] Refresh endpoints count: 0
I0402 16:36:43.348330  5092 naming_engine.cc:100] Update endpoint id: 0, address: , filepath: ./distributed/endpoints/0
I0402 16:36:43.348413  5092 coordinator.cc:190] Coordinator sink start/
I0402 16:36:43.348430  5092 coordinator.cc:216] Sink ./distributed/start/0OK
I0402 16:36:44.348294  5181 coordinator.cc:190] Coordinator sink 
I0402 16:36:44.348353  5181 coordinator.cc:216] Sink ./distributed/startedOK
I0402 16:36:44.348357  5181 coordinator.cc:106] Master sync started.
W0402 16:36:44.348363  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:44.348378  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:44.348367  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:44.348412  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:44.348418  5182 naming_engine.cc:159] Refresh endpoints count: 0
I0402 16:36:44.363948  5312 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.363970  5319 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.364365  5315 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.364580  5320 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.364948  5313 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.364956  5314 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.366432  5316 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.366991  5318 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.367408  5317 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.369257  5328 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.369416  5322 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
W0402 16:36:44.369658  5332 channel_manager.cc:100] Waiting for all servers started: 0/2
I0402 16:36:44.369700  5330 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.371842  5321 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.371852  5324 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.371871  5335 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.371891  5323 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.371901  5325 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.371906  5326 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.371908  5340 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.371932  5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.371938  5335 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.371976  5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372058  5341 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372062  5338 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372076  5334 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372140  5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372221  5338 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372298  5342 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372383  5329 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372404  5327 notification.cc:126] RpcNotification:Start  req_type:UpdateEdges    size:2
I0402 16:36:44.372439  5342 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
I0402 16:36:44.372475  5342 notification.cc:149] RpcNotification:Notify req_type:UpdateEdges    remote_id:0     total:2
W0402 16:36:45.348466  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:45.348598  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:45.348598  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:45.350410  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:45.350420  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:46.350499  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:46.350539  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:46.350548  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:46.350564  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:46.350572  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:47.350625  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:47.350672  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:47.350684  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:47.350706  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:47.350713  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:48.350780  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:48.350838  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:48.350852  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:48.350885  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:48.350893  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:49.351042  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:49.351099  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:49.351107  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:49.351315  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:49.351348  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:50.351255  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:50.351312  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:50.352686  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:50.351442  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:50.352722  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:51.352807  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:51.352880  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:51.352929  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:51.352946  5182 naming_engine.cc:154] Invalid endpoint file: 1
I0402 16:36:51.352957  5182 naming_engine.cc:159] Refresh endpoints count: 0
W0402 16:36:52.352988  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:52.353049  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed

...

graphlearn.xxxx.WARNING.20200402-163642.5092

W0402 16:36:42.347764  5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed    
W0402 16:36:42.348039  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:42.348053  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:43.348107  5181 coordinator.cc:177] Counting states failed: start/, Internal:./distributed/start/ open failed
W0402 16:36:43.348150  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:43.348160  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:44.348363  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:44.348378  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:44.348367  5182 naming_engine.cc:154] Invalid endpoint file: 0                                                   
W0402 16:36:44.348412  5182 naming_engine.cc:154] Invalid endpoint file: 1                                             
W0402 16:36:44.369658  5332 channel_manager.cc:100] Waiting for all servers started: 0/2                                     
W0402 16:36:45.348466  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:45.348598  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:45.348598  5182 naming_engine.cc:154] Invalid endpoint file: 0                                             
W0402 16:36:45.350410  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                   
W0402 16:36:46.350499  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:46.350539  5182 naming_engine.cc:154] Invalid endpoint file: 0                                                   
W0402 16:36:46.350548  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed
W0402 16:36:46.350564  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                   
W0402 16:36:47.350625  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:47.350672  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:47.350684  5182 naming_engine.cc:154] Invalid endpoint file: 0                                             
W0402 16:36:47.350706  5182 naming_engine.cc:154] Invalid endpoint file: 1
W0402 16:36:48.350780  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:48.350838  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:48.350852  5182 naming_engine.cc:154] Invalid endpoint file: 0                                             
W0402 16:36:48.350885  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                   
W0402 16:36:49.351042  5182 naming_engine.cc:154] Invalid endpoint file: 0
W0402 16:36:49.351099  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                   
W0402 16:36:49.351315  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:49.351348  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:50.351255  5182 naming_engine.cc:154] Invalid endpoint file: 0                                             
W0402 16:36:50.351312  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                   
W0402 16:36:50.351442  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:50.352722  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed                     
W0402 16:36:51.352807  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:51.352880  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed                     
W0402 16:36:51.352929  5182 naming_engine.cc:154] Invalid endpoint file: 0                                                                  
W0402 16:36:51.352946  5182 naming_engine.cc:154] Invalid endpoint file: 1                                                                  
W0402 16:36:52.352988  5181 coordinator.cc:177] Counting states failed: prepare/, Internal:./distributed/prepare/ open failed
W0402 16:36:52.353049  5181 coordinator.cc:177] Counting states failed: stop/, Internal:./distributed/stop/ open failed      
W0402 16:36:52.353111  5182 naming_engine.cc:154] Invalid endpoint file: 0                                           
baoleai commented 4 years ago

The log shows the GL tracker dir still has not been cleaned up, for your case, run rm -rf ./distributed/*, to clean up tracker. You may need add sleep 1 after python dist_train.py --index 0, so that TF&GL can start and exit in the correct order.