error when use rpc mode to sync system states

amznero commented 3 years ago

Hi there, Something went wrong when I used RPC mode to sync system states, which mentioned in pr-65.

Dataset: Cora Code: graph-learn/examples/tf/graphsage Config: 1 ps + 1 worker

CODE SNIPPET


  gl.set_tracker_mode(0)
  ...
  graph_cluster = {"client": FLAGS.worker_hosts, "server": FLAGS.ps_hosts}
  # graph_cluster = {"client_count": len(FLAGS.worker_hosts.split(",")), "tracker": FLAGS.tracker, "server_count": len(FLAGS.ps_hosts.split(","))}
  g.init(cluster=graph_cluster, job_name=g_role, task_index=FLAGS.task_index)

ERROR LOG

PS

('run graph-learn command:', u'python /app/code-19811a6d/dist_train.py --ps_hosts=graph-learn-test5wl55-ps-0.ai-test.svc:2222 --worker_hosts=graph-learn-test5wl55-worker-0.ai-test.svc:2222 --job_name=ps --task_index=0')

E1020 08:04:33.558741500      81 server_chttp2.cc:40]        {"created":"@1603181073.558718452","description":"Name or service not known","errno":-2,"file":"src/core/lib/iomgr/resolve_address_posix.cc","file_line":108,"os_error":"Name or servicenot known","syscall":"getaddrinfo","target_address":"graph-learn-test5wl55-ps-0.ai-test.svc:2222"}

Segmentation fault (core dumped)

WORKER

2020-10-20 08:04:43.987375: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0

2020-10-20 08:04:53.987584: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0

2020-10-20 08:05:03.987794: I tensorflow/core/distributed_runtime/master.cc:267] CreateSession still waiting for response from worker: /job:ps/replica:0/task:0
...

Seventeen17 commented 3 years ago

Graphlearn server and Tensorflow ps can't use the same host. So assign another host for graphlearn server. Command like:

python dist_train.py --ps_hosts=ip1:2222 --worker_hosts=ip2:2222 --gl_hosts=ip1:2223 --job_name=ps --task_index=0

And use the gl_hosts for constructing graphlearn cluster:

graph_cluster = {"client": FLAGS.worker_hosts, "server": FLAGS.gl_hosts}

amznero commented 3 years ago

Thank you for the response.

It's working for me when I chose two different hosts for TF-PS and GL-server process.

But if I use a K8S's TFJOB to startup the program, each pod will only get one IP: Host.

Is there have any solutions to solve this problem?

Seventeen17 commented 3 years ago

In kubeflow, you can add replicas of Evaluator for GL-server.

alibaba / graph-learn