JianGoForIt / YellowFin

auto-tuning momentum SGD optimizer
Apache License 2.0
422 stars 93 forks source link

Global Step is not updating? #13

Closed kwotsin closed 7 years ago

kwotsin commented 7 years ago

As seen in https://github.com/Zehaos/MobileNet/issues/27 -- the global step does not update after each training step has been taken. Is there a fix to this coming up soon? I have tried both the older version of yellowfin.py in that issue and also the latest one available. In both instances, the global step doesn't update.

I believe the issue comes from the global variable existing only within the optimizer but not globally. As a quick fix, I moved the definition of the global step (at https://github.com/JianGoForIt/YellowFin/blob/master/tuner_utils/yellowfin.py#L60) out of the optimizer and directly in the graph, before feeding in this variable back to the optimizer.

Is there a cleaner solution to this?

ReDeiPirati commented 7 years ago

I believe the issue comes from the global variable existing only within the optimizer but not globally.

You get the point!! I have solved keeping 2 var: an external global_step and the local one(renamed as YF_step). I think that using only one variable is a more elegant solution, but we have to know in advance that the Optimizer needs a global_step variable. This is my solution (i'm trying to adapt Yellowfin to be usable in T2T). Can i see your code?

JianGoForIt commented 7 years ago

Hi @ReDeiPirati Thanks for the solution. But I got a bit confused about the issue. @ReDeiPirati @kwotsin As in this https://github.com/JianGoForIt/YellowFin/blob/master/tuner_utils/yellowfin.py#L224, the self._global_step is incremented every iteration. From with in the optimizer, the self._global_step is updated, otherwise this line can not function properly. Do you mean it is not incremented if you seen it from the outside the optimizer?

Cheers,

kwotsin commented 7 years ago

@JianGoForIt Yes I think when using some tools like a Supervisor to train, having some control of the global step will be very important in tracking the summaries. In that case, is there a way to extract the global step from the optimizer? From what I see, the TF default optimizers have an implicit way of incrementing the global step instead of initializing one within the optimizer.

@ReDeiPirati The fix I used was quite simple, in this line: https://github.com/JianGoForIt/YellowFin/blob/master/tuner_utils/yellowfin.py#L60

I moved the definition of the global step out of the optimizer and in the default graph I'm using, and added an extra argument global_step in the optimizer. Then I fed the global step back to the optimizer in the graph. As the global step was defined outside of the optimizer, it makes it easier to use an exponential decaying learning rate (that could be fed as an initial learning rate to the optimizer) and also provided to a training supervisor for monitoring/summary updates.

ReDeiPirati commented 7 years ago

@JianGoForIt Yes, if we declare an external global_step variable, in your code there isn't an update(is updated only the local one), but the fix is very simple: In minimize pass the global_step,

267    return self.apply_gradients(grads_and_vars, global_step=global_step)

let the underling momentum to update it in apply_gradients

211    self._optimizer.apply_gradients(zip(self._grads_clip, self._tvars), global_step=global_step)
...
214    self._optimizer.apply_gradients(zip(self._grads, self._tvars) global_step=global_step))

As an alternative you can update the external global_step as describe in tf.train.Optimizer. Or as suggested above you can move the global_step outside(but you need to do a sanity check to run the otpimizer).

@kwotsin Thanks. I like your solution but it's not tf.train.Optimizer compliant, because no Optimizer requires global_step(it could be None), while YellowFin needs this argument. For this reason i keep 2 variables.

kwotsin commented 7 years ago

@ReDeiPirati thanks for the nice fix! Is it possible for you to make a pull request? I think this will help other people facing similar issues.

ReDeiPirati commented 7 years ago

@kwotsin PR done!

JianGoForIt commented 7 years ago

@kwotsin @ReDeiPirati , we have updated the code and the global_step issue is gone now. Please check it out.