davidhershey / feudal_networks

An implementation of FeUdal Networks for Hierarchical Reinforcement Learning as published : https://arxiv.org/abs/1703.01161
MIT License
178 stars 46 forks source link

trouble in "--policy feudal" #6

Open huoliangyu opened 6 years ago

huoliangyu commented 6 years ago

Hi, I would like to use your project,but I got some trouble in setting "--policy feudal". I can directly run python train.py and it works normally with default "--policy lstm". But when I switch to add this parameter as python train.py --policy feudal, I got following output:

[2018-04-19 22:01:28,989] Events directory: /tmp/pong/train_0 [2018-04-19 22:01:29,342] Starting session. If this hangs, we're mostly likely w aiting to connect to the parameter server. One common cause is that the paramete r server DNS name isn't resolving yet, or is misspecified. 2018-04-19 22:01:29.431565: I tensorflow/core/distributed_runtime/master_session .cc:998] Start master session 0f5becf7698cbfb7 with config: intra_op_parallelism _threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0" inter_op_parallelism_threads: 2 Traceback (most recent call last): File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/client/session.py", line 1327, in _do_call return fn(args) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/client/session.py", line 1306, in _run_fn status, run_metadata) [2018-04-19 22:01:28,989] Events directory: /tmp/pong/train_0 [2018-04-19 22:01:29,342] Starting session. If this hangs, we're mostly likely w aiting to connect to the parameter server. One common cause is that the paramete r server DNS name isn't resolving yet, or is misspecified. 2018-04-19 22:01:29.431565: I tensorflow/core/distributed_runtime/master_session .cc:998] Start master session 0f5becf7698cbfb7 with config: intra_op_parallelism _threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0" inter_op_parallelism_threads: 2 Traceback (most recent call last): File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/client/session.py", line 1327, in _do_call return fn(args) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/client/session.py", line 1306, in _run_fn status, run_metadata) [2018-04-19 22:01:28,989] Events directory: /tmp/pong/train_0 [2018-04-19 22:01:29,342] Starting session. If this hangs, we're mostly likely w aiting to connect to the parameter server. One common cause is that the paramete r server DNS name isn't resolving yet, or is misspecified. 2018-04-19 22:01:29.431565: I tensorflow/core/distributed_runtime/master_session .cc:998] Start master session 0f5becf7698cbfb7 with config: intra_op_parallelism _threads: 1 device_filters: "/job:ps" device_filters: "/job:worker/task:0/cpu:0" inter_op_parallelism_threads: 2 Traceback (most recent call last): File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/client/session.py", line 1327, in _do_call return fn(args) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/client/session.py", line 1306, in _run_fn status, run_metadata) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/contextlib.py", l ine 88, in exit next(self.gen) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok _status pywrap_tensorflow.TF_GetCode(status)) tensorflow.python.framework.errors_impl.NotFoundError: Key global/FeUdal/worker/ rnn/basic_lstm_cell/bias/Adam_1 not found in checkpoint [[Node: save/RestoreV2_55 = RestoreV2[dtypes=[DT_FLOAT], _device="/job: ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_55/tensor_name s, save/RestoreV2_55/shape_and_slices)]] During handling of the above exception, another exception occurred: Traceback (most recent call last): File "worker.py", line 174, in tf.app.run() File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "worker.py", line 166, in main run(args, server) File "worker.py", line 94, in run with sv.managed_session(server.target, config=config) as sess, sess.as_defau lt(): File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/contextlib.py", l ine 81, in enter return next(self.gen) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/supervisor.py", line 964, in managed_session self.stop(close_summary_writer=close_summary_writer) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/supervisor.py", line 792, in stop stop_grace_period_secs=self._stop_grace_secs) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/coordinator.py", line 389, in join six.reraise(self._exc_info_to_raise) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/six .py", line 686, in reraise raise value File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/supervisor.py", line 953, in managed_session start_standard_services=start_standard_services) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/supervisor.py", line 708, in prepare_or_wait_for_session init_feed_dict=self._init_feed_dict, init_fn=self._init_fn) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/session_manager.py", line 273, in prepare_session config=config) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/session_manager.py", line 205, in _restore_checkpoint saver.restore(sess, ckpt.model_checkpoint_path) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/saver.py", line 1560, in restore {self.saver_def.filename_tensor_name: save_path}) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/client/session.py", line 895, in run run_metadata_ptr) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/client/session.py", line 1124, in _run feed_dict_tensor, options, run_metadata) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/client/session.py", line 1321, in _do_run options, run_metadata) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/client/session.py", line 1340, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.NotFoundError: Key global/FeUdal/worker/ rnn/basic_lstm_cell/bias/Adam_1 not found in checkpoint [[Node: save/RestoreV2_55 = RestoreV2[dtypes=[DT_FLOAT], _device="/job: ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_55/tensor_name s, save/RestoreV2_55/shape_and_slices)]] Caused by op 'save/RestoreV2_55', defined at: File "worker.py", line 174, in tf.app.run() File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/platform/app.py", line 48, in run _sys.exit(main(_sys.argv[:1] + flags_passthrough)) File "worker.py", line 166, in main run(args, server) File "worker.py", line 50, in run saver = FastSaver(variables_to_save) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/saver.py", line 1140, in init self.build() File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/saver.py", line 1172, in build filename=self._filename) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/saver.py", line 688, in build restore_sequentially, reshape) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/saver.py", line 407, in _AddRestoreOps tensors = self.restore_op(filename_tensor, saveable, preferred_shard) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/training/saver.py", line 247, in restore_op [spec.tensor.dtype])[0]) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/ops/gen_io_ops.py", line 663, in restore_v2 dtypes=dtypes, name=name) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/framework/op_def_library.py", line 767, in apply_op op_def=op_def) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/framework/ops.py", line 2630, in create_op original_op=self._default_original_op, op_def=op_def) File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/ten sorflow/python/framework/ops.py", line 1204, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected- access NotFoundError (see above for traceback): Key global/FeUdal/worker/rnn/ba[26/480] _cell/bias/Adam_1 not found in checkpoint [[Node: save/RestoreV2_55 = RestoreV2[dtypes=[DT_FLOAT], _device="/job: ps/replica:0/task:0/cpu:0"](_recv_save/Const_0_S1, save/RestoreV2_55/tensor_name s, save/RestoreV2_55/shape_and_slices)]] ERROR:tensorflow:================================== Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>): <tf.Tensor 'report_uninitialized_variables/boolean_mask/Gather:0' shape=(?,) dty pe=string> If you want to mark it as used call its "mark_used()" method. It was originally created here: ['File "worker.py", line 174, in \n tf.app.run()', 'File "/home/xunti an2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/tensorflow/python/plat form/app.py", line 48, in run\n _sys.exit(main(_sys.argv[:1] + flags_passthro ugh))', 'File "worker.py", line 166, in main\n run(args, server)', 'File "wor ker.py", line 77, in run\n ready_op=tf.report_uninitialized_variables(variabl es_to_save),', 'File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/sit e-packages/tensorflow/python/util/tf_should_use.py", line 175, in wrapped\n r eturn _add_should_use_warning(fn(*args, **kwargs))', 'File "/home/xuntian2/anaco nda2/envs/fedal_tf16/lib/python3.6/site-packages/tensorflow/python/util/tf_shoul d_use.py", line 144, in _add_should_use_warning\n wrapped = TFShouldUseWarni$ gWrapper(x)', 'File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site -packages/tensorflow/python/util/tf_should_use.py", line 101, in init\n s tack = [s.strip() for s in traceback.format_stack()]']

[2018-04-19 22:01:29,676] ================================== Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>): <tf.Tensor 'report_uninitialized_variables/boolean_mask/Gather:0' shape=(?,) dty pe=string> If you want to mark it as used call its "mark_used()" method. It was originally created here: ['File "worker.py", line 174, in \n tf.app.run()', 'File "/home/xunti an2/anaconda2/envs/fedal_tf16/lib/python3.6/site-packages/tensorflow/python/plat form/app.py", line 48, in run\n _sys.exit(main(_sys.argv[:1] + flags_passthro ugh))', 'File "worker.py", line 166, in main\n run(args, server)', 'File "wor ker.py", line 77, in run\n ready_op=tf.report_uninitialized_variables(variabl es_to_save),', 'File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/sit e-packages/tensorflow/python/util/tf_should_use.py", line 175, in wrapped\n r eturn _add_should_use_warning(fn(*args, **kwargs))', 'File "/home/xuntian2/anaco nda2/envs/fedal_tf16/lib/python3.6/site-packages/tensorflow/python/util/tf_shoul d_use.py", line 144, in _add_should_use_warning\n wrapped = TFShouldUseWarnin gWrapper(x)', 'File "/home/xuntian2/anaconda2/envs/fedal_tf16/lib/python3.6/site -packages/tensorflow/python/util/tf_should_use.py", line 101, in init\n s tack = [s.strip() for s in traceback.format_stack()]']

could you please tell me what is the problem? Thanks a lot.

lucasliunju commented 6 years ago

I use "python train.py -p feudal", and the code is running.