Closed skyssj closed 4 years ago
The INFO log shows that the IP address needed for graph-learn is empty:
I20200405 11:27:22.089519 21023 naming_engine.cc:100] Update endpoint id: 0, address: , filepath: /tmp/graphlearn/endpoints/0
.
This indicates that GetLocalEndpoint returns "". You can check whether this function can get the correct result in your environment
@baoleai when I try to set up this in the distributed setting for example, two machines with different IPs, it always says that
I0720 12:13:39.899207 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:40.899502 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:41.899794 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:42.900085 17530 naming_engine.cc:159] Refresh endpoints count: 2
2020-07-20 12:13:43.369141: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2020-07-20 12:13:43.369198: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
And I have checked my task_idx which are matched with the ps_hosts and worker_hosts, and I also turn off the firewall on both of my computers.
Also, I have try to run two ps and two workers on the same physical machine with different port, it can run but return me message like
Epoch 39, Iteration 0, Time(s) 0.0530, Loss 0.88920
Epoch 39, Iteration 1, Time(s) 0.0529, Loss 0.59737
Epoch 39, Iteration 2, Time(s) 0.0523, Loss 0.77411
Epoch 39, Iteration 3, Time(s) 0.0497, Loss 0.79809
Epoch 39, Iteration 4, Time(s) 0.0540, Loss 0.45329
Epoch 39, Iteration 5, Time(s) 0.0521, Loss 0.98397
Epoch 39, Iteration 6, Time(s) 0.0504, Loss 0.71765
E0720 11:56:27.316502 13822 notification.cc:194] RpcNotification:Failed req_type:GetNodes status:Out of range:No more nodes exist.
E0720 11:56:27.316629 13822 distribute_runner.h:125] Rpc failed:Out of range:No more nodes exist.name:GetNodes
Could you please help me to figure it out? Thanks
@baoleai when I try to set up this in the distributed setting for example, two machines with different IPs, it always says that
I0720 12:13:39.899207 17530 naming_engine.cc:159] Refresh endpoints count: 2 I0720 12:13:40.899502 17530 naming_engine.cc:159] Refresh endpoints count: 2 I0720 12:13:41.899794 17530 naming_engine.cc:159] Refresh endpoints count: 2 I0720 12:13:42.900085 17530 naming_engine.cc:159] Refresh endpoints count: 2 2020-07-20 12:13:43.369141: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0 2020-07-20 12:13:43.369198: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1
And I have checked my task_idx which are matched with the ps_hosts and worker_hosts, and I also turn off the firewall on both of my computers.
Also, I have try to run two ps and two workers on the same physical machine with different port, it can run but return me message like
Epoch 39, Iteration 0, Time(s) 0.0530, Loss 0.88920 Epoch 39, Iteration 1, Time(s) 0.0529, Loss 0.59737 Epoch 39, Iteration 2, Time(s) 0.0523, Loss 0.77411 Epoch 39, Iteration 3, Time(s) 0.0497, Loss 0.79809 Epoch 39, Iteration 4, Time(s) 0.0540, Loss 0.45329 Epoch 39, Iteration 5, Time(s) 0.0521, Loss 0.98397 Epoch 39, Iteration 6, Time(s) 0.0504, Loss 0.71765 E0720 11:56:27.316502 13822 notification.cc:194] RpcNotification:Failed req_type:GetNodes status:Out of range:No more nodes exist. E0720 11:56:27.316629 13822 distribute_runner.h:125] Rpc failed:Out of range:No more nodes exist.name:GetNodes
Could you please help me to figure it out? Thanks
I have same problem, do you have any solution?
After patch the fix(#4 #11), I rebuild/reinstall graph-learn and use below commands to start dist_train.py, problem still.
Stdout&Stderr stdout&stderr.txt
Server-Logs: graphlearn.VM_10_224_centos.ced.log.WARNING.20200405-112725.21131.log graphlearn.VM_10_224_centos.ced.log.WARNING.20200405-112721.21023.log graphlearn.VM_10_224_centos.ced.log.INFO.20200405-112721.21023.log graphlearn.VM_10_224_centos.ced.log.INFO.20200405-112725.21131.log