JianGoForIt / YellowFin

auto-tuning momentum SGD optimizer
Apache License 2.0
422 stars 93 forks source link

Add YellowFin to tensor2tensor #11

Closed ReDeiPirati closed 7 years ago

ReDeiPirati commented 7 years ago

I am trying to adapt YellowFin to be usable as optimizer in tensor2tensor(it's use tensorflow>=1.2.0rc1) but unfortunately i cannot debug this error:

Step to reproduce

  1. Clone this repo.
  2. Launch the starter.sh script (inside a Docker container is better).
  3. (Optional Docker container command) nvidia-docker run -it -v $(pwd):/t2t -p 6006:6006 -w /t2t tensorflow/tensorflow:latest-devel-gpu.

Error

Using YellowFin
INFO:tensorflow:Computing gradients for global model_fn.
ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Operation'>):
<tf.Operation 'training/update_hyper/cond/assert_equal/Assert/Assert' type=Assert>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
['File "/usr/local/bin/t2t-trainer", line 6, in <module>\n    exec(compile(open(__file__).read(), __file__, \'exec\'))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>\n    tf.app.run()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run\n    _sys.exit(main(_sys.argv[:1] + flags_passthrough))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main\n    schedule=FLAGS.schedule)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run\n    run_locally(exp_fn(output_dir))', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally\n    exp.train_and_evaluate()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate\n    self.train(delay_secs=0)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train\n    hooks=self._train_monitors + extra_hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train\n    monitors=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit\n    loss = self._train_model(input_fn=input_fn, hooks=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 955, in _train_model\n    model_fn_ops = self._get_train_ops(features, labels)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1162, in _get_train_ops\n    return self._call_model_fn(features, labels, model_fn_lib.ModeKeys.TRAIN)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1133, in _call_model_fn\n    model_fn_results = self._model_fn(features, labels, **kwargs)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 520, in model_fn\n    colocate_gradients_with_ops=True)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/layers/python/layers/optimizers.py", line 293, in optimize_loss\n    name="train")', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 1154, in apply_gradients\n    gradients, global_step=global_step, name=name)', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 222, in apply_gradients\n    update_hyper_op = self.update_hyper_param()', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 190, in update_hyper_param\n    lambda: self._mu_var) )', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1814, in cond\n    orig_res_t, res_t = context_t.BuildCondBranch(true_fn)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/control_flow_ops.py", line 1689, in BuildCondBranch\n    original_result = fn()', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 189, in <lambda>\n    self._mu = tf.identity(tf.cond(self._do_tune, lambda: self.get_mu_tensor(),', 'File "/t2t/tensor2tensor/utils/yellowfin.py", line 180, in get_mu_tensor\n    tf.assert_equal(tf.size(root), tf.constant(1) )', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/check_ops.py", line 318, in assert_equal\n    return control_flow_ops.Assert(condition, data, summarize=summarize)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped\n    return _add_should_use_warning(fn(*args, **kwargs))', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning\n    wrapped = TFShouldUseWarningWrapper(x)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__\n    stack = [s.strip() for s in traceback.format_stack()]']
==================================
INFO:tensorflow:Global model_fn finished.
INFO:tensorflow:Create CheckpointSaverHook.
2017-07-06 14:31:31.807218: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.807260: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.807285: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-06 14:31:31.855132: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:893] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-07-06 14:31:31.855471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: GeForce GTX 670MX
major: 3 minor: 0 memoryClockRate (GHz) 0.601
pciBusID 0000:01:00.0
Total memory: 2.94GiB
Free memory: 2.60GiB
2017-07-06 14:31:31.855541: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-07-06 14:31:31.855567: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-07-06 14:31:31.855606: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 670MX, pci bus id: 0000:01:00.0)
2017-07-06 14:31:32.895272: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895276: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895446: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895327: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895466: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895573: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895625: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895675: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895693: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.895545: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.897115: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.901863: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.902270: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.902804: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.903010: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.903597: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.904450: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.904735: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:32.907982: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
2017-07-06 14:31:33.041912: W tensorflow/core/framework/op_kernel.cc:1158] Failed precondition: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]
Traceback (most recent call last):
  File "/usr/local/bin/t2t-trainer", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run
    run_locally(exp_fn(output_dir))
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally
    exp.train_and_evaluate()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate
    self.train(delay_secs=0)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
    hooks=self._train_monitors + extra_hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
    monitors=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model
    config=self._session_config
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__
    stop_grace_period_secs=stop_grace_period_secs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__
    self._sess = _RecoverableSession(self._coordinated_creator)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__
    _WrappedSession.__init__(self, self._create_session())
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session
    return self._sess_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session
    self.tf_sess = self._session_creator.create_session()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 412, in create_session
    init_fn=self._scaffold.init_fn)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 279, in prepare_session
    sess.run(init_op, feed_dict=init_feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.FailedPreconditionError: Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]

Caused by op u'global_step/read', defined at:
  File "/usr/local/bin/t2t-trainer", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>
    tf.app.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main
    schedule=FLAGS.schedule)
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run
    run_locally(exp_fn(output_dir))
  File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally
    exp.train_and_evaluate()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate
    self.train(delay_secs=0)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train
    hooks=self._train_monitors + extra_hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train
    monitors=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit
    loss = self._train_model(input_fn=input_fn, hooks=hooks)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 952, in _train_model
    global_step = contrib_framework.create_global_step(g)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/framework/python/ops/variables.py", line 133, in create_global_step
    return training_util.create_global_step(graph)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/training_util.py", line 119, in create_global_step
    collections=[ops.GraphKeys.GLOBAL_VARIABLES, ops.GraphKeys.GLOBAL_STEP])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 367, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
    use_resource=use_resource)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variable_scope.py", line 725, in _get_single_variable
    validate_shape=validate_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 200, in __init__
    expected_shape=expected_shape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/variables.py", line 319, in _init_from_args
    self._snapshot = array_ops.identity(self._variable, name="read")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1303, in identity
    result = _op_def_lib.apply_op("Identity", input=input, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

FailedPreconditionError (see above for traceback): Attempting to use uninitialized value global_step
     [[Node: global_step/read = Identity[T=DT_INT64, _class=["loc:@global_step"], _device="/job:localhost/replica:0/task:0/cpu:0"](global_step)]]

ERROR:tensorflow:==================================
Object was never used (type <class 'tensorflow.python.framework.ops.Tensor'>):
<tf.Tensor 'report_uninitialized_variables_1/boolean_mask/Gather:0' shape=(?,) dtype=string>
If you want to mark it as used call its "mark_used()" method.
It was originally created here:
['File "/usr/local/bin/t2t-trainer", line 6, in <module>\n    exec(compile(open(__file__).read(), __file__, \'exec\'))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 83, in <module>\n    tf.app.run()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run\n    _sys.exit(main(_sys.argv[:1] + flags_passthrough))', 'File "/t2t/tensor2tensor/bin/t2t-trainer", line 79, in main\n    schedule=FLAGS.schedule)', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 247, in run\n    run_locally(exp_fn(output_dir))', 'File "/t2t/tensor2tensor/utils/trainer_utils.py", line 537, in run_locally\n    exp.train_and_evaluate()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 495, in train_and_evaluate\n    self.train(delay_secs=0)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 275, in train\n    hooks=self._train_monitors + extra_hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/experiment.py", line 665, in _call_train\n    monitors=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 289, in new_func\n    return func(*args, **kwargs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 455, in fit\n    loss = self._train_model(input_fn=input_fn, hooks=hooks)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/learn/python/learn/estimators/estimator.py", line 1003, in _train_model\n    config=self._session_config', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 352, in MonitoredTrainingSession\n    stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 648, in __init__\n    stop_grace_period_secs=stop_grace_period_secs)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 477, in __init__\n    self._sess = _RecoverableSession(self._coordinated_creator)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 822, in __init__\n    _WrappedSession.__init__(self, self._create_session())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 827, in _create_session\n    return self._sess_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 538, in create_session\n    self.tf_sess = self._session_creator.create_session()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 403, in create_session\n    self._scaffold.finalize()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 192, in finalize\n    default_ready_for_local_init_op)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 254, in get_or_default\n    op = default_constructor()', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 189, in default_ready_for_local_init_op\n    variables.global_variables())', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 170, in wrapped\n    return _add_should_use_warning(fn(*args, **kwargs))', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 139, in _add_should_use_warning\n    wrapped = TFShouldUseWarningWrapper(x)', 'File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/tf_should_use.py", line 96, in __init__\n    stack = [s.strip() for s in traceback.format_stack()]']
==================================

If you do not want to help or contribute, please close the issue and forgive me. Otherwise, i will appreciate any help :)

I've also tried to write YellowFin as an tf.train.Optimizer, but going at C++ level seems to be out of my skills at the moment...

jmhessel commented 7 years ago

(this seems to be the same error I mentioned in #6 )

JianGoForIt commented 7 years ago

Agree with @jmhessel. Did you use an external global_step?

The global_step argument in this line https://github.com/JianGoForIt/YellowFin/blob/master/tuner_utils/yellowfin.py#L204 is an dummy argument.

ReDeiPirati commented 7 years ago

Yes there is another global_step variable but it's correctly initialized. Unfortunately the last merge on tensor2tensor has brought some bugs on the models in which i've tested YellowFin. I need to investigate deeper, for the moment i close.

jmhessel commented 7 years ago

Something should probably be done with that dummy argument, but I didn't want to mess anything up (i.e., I wasn't 100% sure global step tracked by YF was the same as the one passed by keras)