JianGoForIt / YellowFin

auto-tuning momentum SGD optimizer
Apache License 2.0
422 stars 93 forks source link

Issue comparing to default optimizer setting in cifar10 in tensorflow tutorials #7

Closed jinxin0924 closed 7 years ago

jinxin0924 commented 7 years ago

I have tried to replace the optimizer with YellowFin in cifar10 in tensorflow tutorials, but it did not perform well, much worse than the original decay sgd.

The origin code is :

  with tf.control_dependencies([loss_averages_op]):
    opt = tf.train.GradientDescentOptimizer(lr)
    grads = opt.compute_gradients(total_loss)

  # Apply gradients.
  apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

My code is:

 with tf.control_dependencies([loss_averages_op]):
        opt = YFOptimizer(lr=1.0, mu=0.0)
        # opt = tf.train.GradientDescentOptimizer(learning_rate=0.01)
        grads = opt.compute_gradients(total_loss)
    apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)

I simply copied the yellowfin.py from Zehaos's yellowfin.py, which added compute_gradients function.

Did I miss something?

JianGoForIt commented 7 years ago

Hi @jinxin0924,

Thanks for trying out the optimizer. Could you be more specific about what is the phenomenon you observe?

Did you use multiple gpu as I noticed that you have with tf.control_dependencies([loss_averages_op]). Could you run it on a single gpu version as we have only tested on single gpu/cpu training.

How many iterations does it finish before you observe undesired loss level? As YF's lr and momentum changes gradually, it is norm that it is slow in the first a few thousand iterations, but in the long run it might be better as we have presented in our paper.

jinxin0924 commented 7 years ago

Hi @JianGoForIt, Thanks for your reply. Actually I used singe GPU and 1,000,000 steps to train the model. The result is :
image I recorded the training loss every 10 steps.

ps: I am new to deep learning so maybe I did something wrong in my code...

JianGoForIt commented 7 years ago

Hi @jinxin0924,

This plot is very informative. If you smooth the curve a bit, you should find that before the learning rate drop, YF is better than the default.

The default starts to win after it drops the learning rate while YF does not. We have already left interface for dropping learning for YF too. Please check out the detailed guideline section in the Readme here for more information.

https://github.com/JianGoForIt/YellowFin

If you also drop the learning rate of Yellowfin at some points, the performance should improve. Not sure how you should the set drop factor for the learning rate in your model, but I suggest start from some larger factors like 0.5 or 0.1.

More generally speaking, YF gives a no-tuning optimizer. We wouldn't hope to always beat carefully hand-tuned optimizer, especially with highly optimized non-constant learning rate scheme. When comparing to a highly manual-optimized lr scheme for the default optimizer, you might need to also tweak a bit on YF with the given argument. Sometimes, you can also monitor the test accuracy as the loss is not always the best proxy for training performance.

Please let me know if it helps or not.

JianGoForIt commented 7 years ago

Close due to lack of activity. Please reopen if necessary.

jinxin0924 commented 7 years ago

Sorry for the delay. Last few days I used cifar-10 to test several optimizers on some models (Resnet18, Resnet34, a small CNN model) and have several findings:

  1. Yellowfin is not sensitive with initial learning rate after some epochs.
  2. Yellowfin always performed good, but was not good enough to beat Nesterov momentum with the default value. Their performance are almost the same.
  3. Yellowfin had more fluctuations, so adding clipping may help.

Next I plan to test Yellowfin in asynchronous parallelism. Hope it will perform well!

JianGoForIt commented 7 years ago

@jinxin0924 Thanks for the feedback.

regarding 2, I have a quick comments. You may want to try initial momentum 0.9 or initial learning rate 1.0. In CNNs, it typically accelerates with this setting instead of the default. We use the default value as a safe setting whether it works for all the experiment we have in our papers.

Just to be clear, you are comparing YF to hand-tuned Nesterov momentum, right?

Cheers,

jinxin0924 commented 7 years ago

@JianGoForIt Sorry, I made a mistake. I used momentum rather than Nesterov momentum. The momentum had learning rate 0.01 and momentum 0.9 and no learning decay.