alibaba / graph-learn

An Industrial Graph Neural Network Framework
Apache License 2.0
1.28k stars 267 forks source link

Still can not run graphsage dist_train locally #4 #14

Closed skyssj closed 4 years ago

skyssj commented 4 years ago

After patch the fix(#4 #11), I rebuild/reinstall graph-learn and use below commands to start dist_train.py, problem still.

PS_HOSTS="127.0.0.1:2300,127.0.0.1:2311"
WK_HOSTS="127.0.0.1:2200,127.0.0.1:2222"

TRACK_DIR="/tmp/graphlearn/"
rm -rf ${TRACK_DIR}
mkdir -p ${TRACK_DIR}

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=0 &

sleep 2

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=0 &

sleep 2

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=ps \
  --task_index=1 &

sleep 2

python dist_train.py \
  --tracker=${TRACK_DIR} \
  --ps_hosts=${PS_HOSTS} \
  --worker_hosts=${WK_HOSTS} \
  --job_name=worker \
  --task_index=1 &

wait

Stdout&Stderr stdout&stderr.txt

Server-Logs: graphlearn.VM_10_224_centos.ced.log.WARNING.20200405-112725.21131.log graphlearn.VM_10_224_centos.ced.log.WARNING.20200405-112721.21023.log graphlearn.VM_10_224_centos.ced.log.INFO.20200405-112721.21023.log graphlearn.VM_10_224_centos.ced.log.INFO.20200405-112725.21131.log

baoleai commented 4 years ago

The INFO log shows that the IP address needed for graph-learn is empty: I20200405 11:27:22.089519 21023 naming_engine.cc:100] Update endpoint id: 0, address: , filepath: /tmp/graphlearn/endpoints/0. This indicates that GetLocalEndpoint returns "". You can check whether this function can get the correct result in your environment

YukeWang96 commented 4 years ago

@baoleai when I try to set up this in the distributed setting for example, two machines with different IPs, it always says that

I0720 12:13:39.899207 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:40.899502 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:41.899794 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:42.900085 17530 naming_engine.cc:159] Refresh endpoints count: 2
2020-07-20 12:13:43.369141: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2020-07-20 12:13:43.369198: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1

And I have checked my task_idx which are matched with the ps_hosts and worker_hosts, and I also turn off the firewall on both of my computers.

Also, I have try to run two ps and two workers on the same physical machine with different port, it can run but return me message like

Epoch 39, Iteration 0, Time(s) 0.0530, Loss 0.88920
Epoch 39, Iteration 1, Time(s) 0.0529, Loss 0.59737
Epoch 39, Iteration 2, Time(s) 0.0523, Loss 0.77411
Epoch 39, Iteration 3, Time(s) 0.0497, Loss 0.79809
Epoch 39, Iteration 4, Time(s) 0.0540, Loss 0.45329
Epoch 39, Iteration 5, Time(s) 0.0521, Loss 0.98397
Epoch 39, Iteration 6, Time(s) 0.0504, Loss 0.71765
E0720 11:56:27.316502 13822 notification.cc:194] RpcNotification:Failed req_type:GetNodes   status:Out of range:No more nodes exist.
E0720 11:56:27.316629 13822 distribute_runner.h:125] Rpc failed:Out of range:No more nodes exist.name:GetNodes

Could you please help me to figure it out? Thanks

zhxchnl commented 3 years ago

@baoleai when I try to set up this in the distributed setting for example, two machines with different IPs, it always says that

I0720 12:13:39.899207 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:40.899502 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:41.899794 17530 naming_engine.cc:159] Refresh endpoints count: 2
I0720 12:13:42.900085 17530 naming_engine.cc:159] Refresh endpoints count: 2
2020-07-20 12:13:43.369141: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
2020-07-20 12:13:43.369198: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:1

And I have checked my task_idx which are matched with the ps_hosts and worker_hosts, and I also turn off the firewall on both of my computers.

Also, I have try to run two ps and two workers on the same physical machine with different port, it can run but return me message like

Epoch 39, Iteration 0, Time(s) 0.0530, Loss 0.88920
Epoch 39, Iteration 1, Time(s) 0.0529, Loss 0.59737
Epoch 39, Iteration 2, Time(s) 0.0523, Loss 0.77411
Epoch 39, Iteration 3, Time(s) 0.0497, Loss 0.79809
Epoch 39, Iteration 4, Time(s) 0.0540, Loss 0.45329
Epoch 39, Iteration 5, Time(s) 0.0521, Loss 0.98397
Epoch 39, Iteration 6, Time(s) 0.0504, Loss 0.71765
E0720 11:56:27.316502 13822 notification.cc:194] RpcNotification:Failed   req_type:GetNodes   status:Out of range:No more nodes exist.
E0720 11:56:27.316629 13822 distribute_runner.h:125] Rpc failed:Out of range:No more nodes exist.name:GetNodes

Could you please help me to figure it out? Thanks

I have same problem, do you have any solution?