ShifuML / shifu

An end-to-end machine learning and data mining framework on Hadoop
https://github.com/ShifuML/shifu/wiki
Apache License 2.0
251 stars 109 forks source link

Worker could be failed in D-Tensorflow #676

Open zhangpengshan opened 5 years ago

zhangpengshan commented 5 years ago

If non-chief worker loading data at first and wait for chief worker some time, below failure found:

19-09-11 01:46:31 tensorflow WARNING From /hadoop03/yarn/local/usercache/pengzhang/appcache/application_1567117635652_359198/container_e278_1567117635652_359198_01_000007/x/home/website/python2.7/lib/python2.7/site-packages/tensorflow/python/util/tf_should_use.py:118: initialize_all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02. Instructions for updating: Use tf.global_variables_initializer instead. 19-09-11 01:46:31 root INFO ---Variables initialized--- 2019-09-11 01:46:31.434298: I tensorflow/core/distributed_runtime/master_session.cc:1017] Start master session 41037549e32365cb with config: allow_soft_placement: true INFO:tensorflow:Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: weight_hidden_layer_0, biases_hidden_layer_0, weight_shifu_output_0, biases_shifu_output_0, global_step, ready: None 19-09-11 01:46:31 tensorflow INFO Waiting for model to be ready. Ready_for_local_init_op: Variables not initialized: weight_hidden_layer_0, biases_hidden_layer_0, weight_shifu_output_0, biases_shifu_output_0, global_step, ready: None 2019-09-11 01:47:01.571978: I tensorflow/core/distributed_runtime/master_session.cc:1017] Start master session f82543dc61f323a9 with config: allow_soft_placement: true 19-09-11 01:47:01 root INFO Starting training on worker 1 19-09-11 01:47:09 root INFO About to execute sync_clean_up_op! 19-09-11 01:47:09 root INFO Done1 19-09-11 01:47:09 root INFO Session from worker 1 closed cleanly

Thanks, Zhang David