Qihoo360 / hbox

AI on Hadoop
Apache License 2.0
1.73k stars 385 forks source link

tensorflow demo运行偶尔出错 #23

Closed FANNG1 closed 6 years ago

FANNG1 commented 6 years ago

18/03/12 20:50:33 INFO XLearningContainer: WARNING:tensorflow:From demo.py:75: init (from tensorflow.python.training.supervisor) is deprecated and will be removed in a future version. 18/03/12 20:50:33 INFO XLearningContainer: Instructions for updating: 18/03/12 20:50:33 INFO XLearningContainer: Please switch to tf.train.MonitoredTrainingSession 18/03/12 20:50:33 INFO XLearningContainer: 2018-03-12 20:50:33.961848: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error 18/03/12 20:50:33 INFO XLearningContainer: Traceback (most recent call last): 18/03/12 20:50:33 INFO XLearningContainer: File "demo.py", line 173, in 18/03/12 20:50:33 INFO XLearningContainer: tf.app.run(main=main) 18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run 18/03/12 20:50:33 INFO XLearningContainer: _sys.exit(main(argv)) 18/03/12 20:50:33 INFO XLearningContainer: File "demo.py", line 76, in main 18/03/12 20:50:33 INFO XLearningContainer: with sv.prepare_or_wait_for_session(server.target, config = tf.ConfigProto(gpu_options=gpu_options, allow_soft_placement = True, log_device_placement = True)) as sess: 18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/training/supervisor.py", line 726, in prepare_or_wait_for_session 18/03/12 20:50:33 INFO XLearningContainer: init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) 18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/training/session_manager.py", line 281, in prepare_session 18/03/12 20:50:33 INFO XLearningContainer: sess.run(init_op, feed_dict=init_feed_dict) 18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 905, in run 18/03/12 20:50:33 INFO XLearningContainer: run_metadata_ptr) 18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1137, in _run 18/03/12 20:50:33 INFO XLearningContainer: feed_dict_tensor, options, run_metadata) 18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run 18/03/12 20:50:33 INFO XLearningContainer: options, run_metadata) 18/03/12 20:50:33 INFO XLearningContainer: File "/usr/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call 18/03/12 20:50:33 INFO XLearningContainer: raise type(e)(node_def, op, message) 18/03/12 20:50:33 INFO XLearningContainer: tensorflow.python.framework.errors_impl.UnavailableError: OS Error

shoukna commented 6 years ago

你好,我也遇到了同样的问题,请问你解决了吗?

FANNG1 commented 6 years ago

还没有, cc @liyuance 你们碰到过没?

fengzanfeng commented 6 years ago

18/03/15 14:48:13 INFO XLearningContainer: 2018-03-15 14:48:13.840737: E tensorflow/core/distributed_runtime/master.cc:269] Master init: Unavailable: OS Error

我也遇到了同样的问题,偶尔会报出这个错误,但不是每次都会出现。现在还没有找到原因~

liyuance commented 6 years ago

感觉像是TF报的错误,不用XLearning直接跑TF能复现吗?

fengzanfeng commented 6 years ago

@liyuance 可以复现。确实不是 XLearning 的问题。

fengzanfeng commented 6 years ago
  1. 另外,发现一个奇怪的现象,在worker 执行 prepare_or_wait_for_session 时,如果 ps 还没有完全启动,就会报上面这个错;
  2. 如果在启动 worker 前 time.sleep(10),基本上就可以避免这个错误;
  3. 具体原因还有待追查。
shoukna commented 6 years ago

但我在启动worker前加上了time.sleep(10),还是没有解决这个问题 @fengzanfeng

fengzanfeng commented 6 years ago

是加在这个位置吗?@shoukna

31 if FLAGS.job_name == "ps": 32 server.join() 33 elif FLAGS.job_name == "worker": 34 time.sleep(15)

shoukna commented 6 years ago

是的

fengzanfeng commented 6 years ago

我找了两台机器测试了一下,在没有启动 ps 时,先启动 worker ,必现。

shoukna commented 6 years ago

我已经成功运行了,谢谢@fengzanfeng

FANNG1 commented 6 years ago

tensorflow不太懂,谁给社区提个issue,看看什么问题?

FANNG1 commented 6 years ago

两个worker,只启动一个,也会出现上面的错误,必现。 这应该算tensorflow的bug吧

FANNG1 commented 6 years ago

https://github.com/tensorflow/tensorflow/issues/17736 在git上提了issue

FANNG1 commented 6 years ago
hanmq commented 6 years ago

你好,我们也遇到相同的问题,请问你们解决了么 @sandflee

chengdianxuezi commented 6 years ago

最近在测试1.8 也遇到这个问题了,tf底层现在是,在刚开始初始化的时候,每个worker会去连接各个ps,如果ps没有启动就会抛出异常。supervisor这个接口当catch这个异常后,直接退出了,我把tf.train.Supervisor 改成tf.train.MonitoredTrainingSession后,现在能够正常跑了,因为MonitorTrainingSession这个接口,当初始化失败后,会重新创建一个新的session

FANNG1 commented 6 years ago

恩,用了MonitorTrainingSession没这个问题了, @han1057578619 @chengdianxuezi