JianGoForIt / YellowFin

auto-tuning momentum SGD optimizer
Apache License 2.0
422 stars 93 forks source link

Bad performance in multiple GPUs #14

Closed jinxin0924 closed 7 years ago

jinxin0924 commented 7 years ago

I used Yellowfin to train Resnet50 on ImageNet using 4 k80 GPUs and got bad performance. After 50k steps, the training loss was about 6, while the SGD without momentum and learning rate decay got only about 4.7. Any idea with this phenomenon?

JianGoForIt commented 7 years ago

@jinxin0924 could you duplicate the same setting on 1 gpu? Is it good or not? As we have not test it on multiple gpus, there might be something that we need to change in terms of implementation.

jinxin0924 commented 7 years ago

@JianGoForIt I duplicated the same setting on 1 gpu and got the training loss 5.1, which is better than muti gpu. Have you tried YF on imaganet dataset?

JianGoForIt commented 7 years ago

Hey @jinxin0924

Thanks for the feedback. We haven't run any Imagenet experiment yet.

Regarding the performance, could you please provide us the visualization of smoothed/unsmoothed training loss? If possible, could you also send us the visualization on curve of tuned learning rate and momentum?

If you really want to match the performance of vanilla sgd with learning rate decay, my suggestions is you also give the learning rate decay scheme to YF to see how it works. Here is an example of using the lr_factor which enables decaying learning rate.

regarding the multiple gpu issue, my suspicion is that the statistics are per-worker specific in the current implementation.

Please let us know whether my suggestion works or not.

Cheers,

jinxin0924 commented 7 years ago

@JianGoForIt Hi, here are some visualizations of unsmoothed training loss:

  1. SGD with 4-K80 GPU, learning rate 0.05, no decay and no momentum sgd_1gpu

  2. YF with 4-K80 GPU, learning rate 1.0, no decay and no clip yf_4gpu

  3. YF with 1-K80 GPU, learning rate 1.0, no decay and no clip yf_1gpu

Actually I have not tried learning rate dacay on imagenet. If I want to try, should I set the same learning rate decay scheme with vanilla sgd?

Besides, do you mean each gpu has it own statistics so the result is not good?

JianGoForIt commented 7 years ago

Hi @jinxin0924

Thanks for providing the detailed information.

regarding your question "do you mean each gpu has it own statistics so the result is not good?", yes, the multiple gpu might be working on its own statistics, which is not good.

regarding the phenomenon, it is very interesting that there is such a large spike on learning rate at the beginning (maybe there are some exploding gradient their). Could you please provide me txt files containing the following information each in a separate file? It might be easier for me to play with it and looking for potential issues.

  1. Per step learning
  2. Per step momentum
  3. Per step value for self._h_min, self._h_max
  4. Per step value for self._grad_var
  5. Per step value for self._dist_to_opt_avg

Maybe we can figure out the phenomenon first before you move to decaying scheme.

Cheers

kwotsin commented 7 years ago

@JianGoForIt I am currently trying out training with YellowFin vs regular RMSProp on the imagenet dataset. However, it seems that the YellowFin is not performing as well as the RMSProp - did you face a common issue? One thing, though, is that the YellowFin learning rate is 1.0, but my RMSProp one starts from 0.1. In your case, it is mentioned that 1.0 is found to be optimal but if I had known of a better learning rate from the start, could this learning rate (say, 0.1) be used for yelllowfin to get better results?

For example, the light blue line is RMSProp but dark blue line is YellowFin. I stopped the training after seeing that it wasn't performing as well as I thought, since the training for ImageNet is very resource-intensive and time consuming.

screenshot from 2017-07-17 12-01-53

JianGoForIt commented 7 years ago

Hi @kwotsin

Thanks for trying out the optimizer.

Regarding the learning rate, you can definitely change the initial learning rate. However, the difference in using different initial learning rate is diminishing after a few k iterations. But sometimes, it can accelerate the beginning phase of training. Comparing the variance of Rmsprop and YF in your above plot. I think you probably should increase the learning rate of YF a bit and I don't think there is a direct correspondence on learning for Rmsprop and YF(SGD+momentum in its nature). If 0.1 is from vanilla SGD or SGD with momentum. It might be a good idea to try 0.1 for initial learning rate.

Additionally, I think you probably want to use the lr_factor parameter to fine-tune the learning rate. It has a instance effect on the learning rate at each iterations. Here is an example use of lr_factor.

In your above plot. I think RMSprop is going to be flat out rather quickly after 25k, which might mean the lr for RMSprop is not that good for long term. Typically people would use smaller lr so that it is slower in the beginning and get lower loss in the long term.

Btw, just out of curiosity, what is the model you are using? Is it a ResNet like model?

Could you try out the lr_factor and keep us updated?

kwotsin commented 7 years ago

@JianGoForIt Thanks for your reply. I'm testing on MobileNet, which I think is a much simpler model than resnet.

So far on my end, it seems that RMSprop is still decreasing the loss past the 25k mark, although the decrement is indeed very slow, but I guess this is characteristic of training imagenet. Also, for the lr_factor, what is a typical value you would suggest? I would use 0.94/2 epochs but I'm not very sure if there's a difference in how yellowfin handles decaying learning rate.

Also, is there a way to integrate the lr_factor into just using tf.train.exponential_decay? I think it might just be cleaner that way since you only pass the learning rate into the optimizer but does not have to actively control the lr value within the training code.

JianGoForIt commented 7 years ago

@kwotsin Thanks for the effort.

Regarding the decay, I do not have a general recommendation. 0.94/epoch sounds to be good. you can use the lr_factor to decay the learning rate.

Integrating using exponential_decay seems to be a reasonable way in some specific cases. We actually followed the style demonstrated in tensorflow examples. They control decaying related issue with tf.assign and etc outside of graph construction phase.

JianGoForIt commented 7 years ago

Hi @kwotsin,

If you are working on MobileNet, there is an expert here @Zehaos https://github.com/JianGoForIt/YellowFin/issues/1

He has done experiment using MNIST. Maybe he can provide some help?

Cheers

kwotsin commented 7 years ago

@JianGoForIt Currently I've tried using YF with the same learning rate at RMSProp just as an experiment, although the convergence seems to be slower. However, I suspect that is because I didn't train the model from scratch. I'll try to retrain the model from scratch with YF if there's more time (a little tight on deadlines).

JianGoForIt commented 7 years ago

Hi @jinxin0924, we added multiple gpu support now. Please check it out.

Cheers

jinxin0924 commented 7 years ago

Hi @JianGoForIt, Sorry for the late response. I have tested the multiple gpu version. Now it behaves like the single gpu one.

Besides, these days I have looked at the paper "Asynchrony begets Momentum" and then realized how Closed-loop YellowFin work. Do you implement it in tensorflow ? Do you compare it with sync algorithm in (time,loss) rather than (iterations,loss)? As I think usually async will be much faster than sync, so maybe compare them in time is more reasonable.

Cheers