alibaba / euler

A distributed graph deep learning framework.
Apache License 2.0
2.9k stars 558 forks source link

使用K8S做分布式训练无法Save Embedding #145

Open da-liii opened 5 years ago

da-liii commented 5 years ago
# ps0
python -m tf_euler --ps_hosts=tf-ps0:8080,tf-ps1:8080 --worker_hosts=tf-worker0:8080,tf-worker1:8080,tf-worker2:8080 --job_name=ps --task_index=0 --model_dir=hdfs://a.b.c.d:8020/user/root/model_ppi
# ps1
python -m tf_euler --ps_hosts=tf-ps0:8080,tf-ps1:8080 --worker_hosts=tf-worker0:8080,tf-worker1:8080,tf-worker2:8080 --job_name=ps --task_index=1 --model_dir=hdfs://a.b.c.d:8020/user/root/model_ppi

# worker0
python -m tf_euler --ps_hosts=tf-ps0:8080,tf-ps1:8080 --worker_hosts=tf-worker0:8080,tf-worker1:8080,tf-worker2:8080 --job_name=worker --task_index=0 --data_dir hdfs://a.b.c.d:8020/user/root/ppi_data_3/ --euler_zk_addr a.b.c.zk:2181 --euler_zk_path /euler --max_id 56944 --feature_idx 1 --feature_dim 50 --label_idx 0 --label_dim 121 --model graphsage_supervised --mode save_embedding --model_dir=hdfs://a.b.c.d:8020/user/root/model_ppi
# worker1
python -m tf_euler --ps_hosts=tf-ps0:8080,tf-ps1:8080 --worker_hosts=tf-worker0:8080,tf-worker1:8080,tf-worker2:8080 --job_name=worker --task_index=1 --data_dir hdfs://a.b.c.d:8020/user/root/ppi_data_3/ --euler_zk_addr a.b.c.zk:2181 --euler_zk_path /euler --max_id 56944 --feature_idx 1 --feature_dim 50 --label_idx 0 --label_dim 121 --model graphsage_supervised --mode save_embedding --model_dir=hdfs://a.b.c.d:8020/user/root/model_ppi
# worker2
python -m tf_euler --ps_hosts=tf-ps0:8080,tf-ps1:8080 --worker_hosts=tf-worker0:8080,tf-worker1:8080,tf-worker2:8080 --job_name=worker --task_index=2 --data_dir hdfs://a.b.c.d:8020/user/root/ppi_data_3/ --euler_zk_addr a.b.c.zk:2181 --euler_zk_path /euler --max_id 56944 --feature_idx 1 --feature_dim 50 --label_idx 0 --label_dim 121 --model graphsage_supervised --mode save_embedding --model_dir=hdfs://a.b.c.d:8020/user/root/model_ppi
# kubectl logs tf-worker0-79998df4b7-st2xk
hdfs file io factory register
local file io factory register
hdfs file io factory register
local file io factory register
2019-08-21 15:24:51.667278: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-21 15:24:51.678608: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job ps -> {0 -> tf-ps0:8080, 1 -> tf-ps1:8080}
2019-08-21 15:24:51.678654: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:222] Initialize GrpcChannelCache for job worker -> {0 -> localhost:8080, 1 -> tf-worker1:8080, 2 -> tf-worker2:8080}
2019-08-21 15:24:51.681684: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:381] Started server with target: grpc://localhost:8080
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0821 15:24:51.683377    34 remote_graph.cc:91] Initialize RemoteGraph, connect to server monitor: [10.1.170.3:2181, /euler]
I0821 15:24:51.697079   182 zk_server_monitor.cc:238] Online node: 2#10.0.68.131:42462.
I0821 15:24:51.697660    34 remote_graph.cc:106] Retrieve meta info success, shard number: 3
I0821 15:24:51.697679    34 remote_graph.cc:119] Retrieve meta info success, partition number: 3
19/08/21 07:24:53 WARN hdfs.DFSClient: zero
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0821 15:24:53.785856   200 graph_builder.cc:81] Load Done: hdfs://10.1.170.1:8020/user/root/ppi_data_3/part_0.dat
I0821 15:24:53.786360   178 graph_builder.cc:127] Each Thread Load Finish! Node Count:18982 Edge Count:544746
I0821 15:24:53.786396   178 graph_builder.cc:135] Graph Loading Finish!
I0821 15:24:53.929447   178 graph_builder.cc:147] Graph Load Finish! Node Count:18982 Edge Count:544746
I0821 15:24:53.932195   178 graph_builder.cc:152] Done: build node sampler
I0821 15:24:53.932214   178 graph_builder.cc:162] Graph build finish
I0821 15:24:53.932658   178 graph_service.cc:179] service init finish
I0821 15:24:53.934512   178 graph_service.cc:131] bound port: 10.0.69.132:36734
W0821 15:24:53.947504   178 graph.h:198] global sampler is not ok
I0821 15:24:53.953155   178 graph_service.cc:146] service start
I0821 15:24:53.954351   182 zk_server_monitor.cc:238] Online node: 0#10.0.69.132:36734.
I0821 15:24:53.955094    34 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: node_sum_weight, Meta Info: 14969.000000,2171.000000,1842.000000
I0821 15:24:53.955127    34 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 0, Key: edge_sum_weight, Meta Info:
I0821 15:24:54.050494   182 zk_server_monitor.cc:238] Online node: 1#10.0.69.133:40330.
I0821 15:24:54.050705    34 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 1, Key: node_sum_weight, Meta Info: 14969.000000,2171.000000,1841.000000
I0821 15:24:54.050724    34 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 1, Key: edge_sum_weight, Meta Info:
I0821 15:24:54.050904    34 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 2, Key: node_sum_weight, Meta Info: 14968.000000,2172.000000,1841.000000
I0821 15:24:54.050923    34 remote_graph.cc:190] Retrieve Shard Meta Info successfully, shard: 2, Key: edge_sum_weight, Meta Info:
WARNING:tensorflow:use_feature is deprecated and would not have any effect.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/euler_gl-0.1.2-py2.7.egg/tf_euler/python/base_layers.py:78: __init__ (from tensorflow.python.ops.init_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.initializers.variance_scaling instead with distribution=uniform to get equivalent behavior.
INFO:tensorflow:Graph was finalized.
19/08/21 07:24:55 WARN hdfs.DFSClient: zero
19/08/21 07:24:55 WARN hdfs.DFSClient: zero
INFO:tensorflow:Restoring parameters from hdfs://10.1.170.1:8020/user/root/model_ppi/model.ckpt-2223
2019-08-21 15:24:55.449866: I tensorflow/core/distributed_runtime/master_session.cc:1161] Start master session 9f629c8b20b8ba57 with config: gpu_options { allow_growth: true }
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:1 workers have finished ...

worker0, worker1, worker2类似,后面就一直输出:

INFO:tensorflow:2 workers have finished ...
JestinyJie commented 5 years ago

请问你解决这个问题了吗?我也遇到了这个问题,而且是evaluate和save的时候都有

JestinyJie commented 5 years ago

好像第一个分配的worker 会先输出1 workers have finished ... 然后又输出0 workers have finished, 然后就开始死循环输出 2 workers have finished

alinamimi commented 5 years ago

目前根据log没发现什么问题,你可以参考下代码 https://github.com/alibaba/euler/blob/ff40594cfebfa55ada4a1142acbc020dab368d81/tf_euler/python/run_loop.py#L181 debug看一下得到的source是否正确

JestinyJie commented 5 years ago

目前根据log没发现什么问题,你可以参考下代码 https://github.com/alibaba/euler/blob/ff40594cfebfa55ada4a1142acbc020dab368d81/tf_euler/python/run_loop.py#L181

debug看一下得到的source是否正确

chief worker 每次会必现 先 1 后 0 后22222的log 这是什么问题呢

alinamimi commented 5 years ago

把source相关的debug log发出来一些,只看目前的现象,还不好定位什么问题

yangsiran commented 5 years ago

试试把SyncExitHook去了,改成在最后sleep两分钟。

YanZhangN commented 5 years ago

https://github.com/alibaba/euler/blob/ff40594cfebfa55ada4a1142acbc020dab368d81/tf_euler/python/run_loop.py#L187

SyncExitHook的代码在这里。 只要保证在sleep的时间内 其他worker都能跑完各自的任务就可以了。

JestinyJie commented 5 years ago

试试把SyncExitHook去了,改成在最后sleep两分钟。

这样可以正常结束了,但是

  1. 是不是有点太暴力了?

  2. 而且在train的时候不会出现这个问题,只有在eval和save的时候会出现。那个0出现的是不是有点奇怪?按理说hook里面是先执行+1的,怎么都不会出现0啊?我还打印了variable的名字,都是同一个

  3. 还有 不同时退出会有什么问题吗?

da-liii commented 5 years ago

试试把SyncExitHook去了,改成在最后sleep两分钟。

Thanks! @yangsiran

xuetf commented 4 years ago

试试把SyncExitHook去了,改成在最后sleep两分钟。

您好,我用最新版的master代码编译并分布式运行,分片数量为5,运行示例的dist_tf_euler.sh(2个ps,2个worker) ,train训练成功,evaluate仍然出现了该issue描述的无法退出的问题。worker0和worker1的日志一直输出“INFO:tensorflow:1 workers have finished ...”。请问该如何解决呢?

xuetf commented 4 years ago

试试把SyncExitHook去了,改成在最后sleep两分钟。

您好,我用最新版的master代码编译并分布式运行,分片数量为5,运行示例的dist_tf_euler.sh(2个ps,2个worker) ,train训练成功,evaluate仍然出现了该issue描述的无法退出的问题。worker0和worker1的日志一直输出“INFO:tensorflow:1 workers have finished ...”。请问该如何解决呢?

 self._num_finished_workers = tf.Variable(
        0, name="num_finished_workers", collections=[tf.GraphKeys.LOCAL_VARIABLES])
def end(self, session):
    session.run(self._finish_self)
    num_finished_workers = session.run(self._num_finished_workers)
    while num_finished_workers < self._num_workers:
      tf.logging.info("%d workers have finished ...", num_finished_workers)
      time.sleep(1)
      num_finished_workers = session.run(self._num_finished_workers)

_num_finished_workers其他的worker怎么让这个变量变化吗?如果不变的话,num_finished_workers应该一直都是self._finish_self执行后的1吧?那不是死循环了吗?

WChCh commented 4 years ago

试试把SyncExitHook去了,改成在最后sleep两分钟。

您好,我用最新版的master代码编译并分布式运行,分片数量为5,运行示例的dist_tf_euler.sh(2个ps,2个worker) ,train训练成功,evaluate仍然出现了该issue描述的无法退出的问题。worker0和worker1的日志一直输出“INFO:tensorflow:1 workers have finished ...”。请问该如何解决呢?

我也遇到这个问题,请问解决了吗?