Luolc / AdaBound

An optimizer that trains as fast as Adam and as good as SGD.
https://www.luolc.com/publications/adabound/
Apache License 2.0
2.9k stars 329 forks source link

What is up with Epoch 150 #7

Closed kootenpv closed 5 years ago

kootenpv commented 5 years ago

I'm wondering what is happening at epoch 150 in all visualizations? I would like to introduce that into all my models ;-)

https://github.com/Luolc/AdaBound/blob/master/demos/cifar10/visualization.ipynb

Luolc commented 5 years ago

Said in the notebook: We employ the fixed budget of 200 epochs and reduce the learning rates by 10 after 150 epochs.

kootenpv commented 5 years ago

That doesn't explain to me how come ALL models make these incredibly huge improvements in a single epoch. To be honest... it just looks wrong to me.

Luolc commented 5 years ago

Well, no offense but I think it's more about basic knowledge in the field of machine learning, and we don't really need to make a discussion here.

You may refer to this vedio by Andrew Ng to gain quick insight about lr decay. Or just search learning rate decay in Google and there're already many great posts introducing this technique.

It is broadly used in many machine learning papers/projects nowadays.

kootenpv commented 5 years ago

@Luolc I am aware of learning rate decay. This is why I think it is extremely weird that in all approaches you get at exactly epoch 150 a huge improvement.

This seems to me to indicate a bad initial learning rate (converging to a local optima)?

Usually it is the case that when a huge improvement is suddenly made, that it indicates the optimization before it was perhaps useless, or don't you agree?

I just wanted to warn you that it seems very odd to have such a huge jump relatively late in optimization and I was hoping there was an explanation for it other than a bad initial learning rate.

Thanks.

Luolc commented 5 years ago

Ok I get what you mean.

Regarding the initial lr, for each optimizer, we conducted a grid search to find the best hyperparameters. For each independent settings, we tested 3~5 times. Indeed, hundreds of runnings were done before we see the final visualization now. I am sure that we've already set the best lr we could ever find (at least the best in the grid). More details can be found in the experiment section of the paper. As mentioned in the demo, the training code is heavily based on this broadly used code base for testing deep CNNs on CIFAR-10. As our best result for SGD even achieves a higher number than that reported in the original repo (~0.4%), I think we did make a successful training.

Usually it is the case that when a huge improvement is suddenly made, that it indicates the optimization before it was perhaps useless, or don't you agree?

I don't think it's approprate to say whether useful or useless. If you may refer to Figure 6(a) in the paper, the learning curves of SGDs with other initial lr are even much worse than what we see in the notebook. So could we say it is the least useless one we may find?

There might be a better decay strategy like making the decay happend earlier --- I totally agree but it's not we are concerning about. What we need is to guarantee the same decay strategy is applied on all the optimizers to make it fair for comparison, rather than finding a best decay strategy.

Finally, I don't think it is a huge jump or odd behavior. I've seen many similar figures in plenty of papers. For example the SWATS.

siaimes commented 5 years ago

@kootenpv You perhaps not understand that the model parameters will be updated many times in one epoch.

kootenpv commented 5 years ago

@siaimes I understand that obviously, but why would it be such a steep change exactly at the 150th epoch - it looks to me like something is just wrong (bad parameters before the 150th). It does make more sense in @Luolc's explanation that these settings turn out to be "less than optimal" for this particular dataset.

Luolc commented 5 years ago

@kootenpv I've recently done some more toy experiments on CIFAR-10 and gain deeper insights now.

FYI: we can employ the lr decay earlier at ~75 epoch and achieve similar results after ~100 epoch.

Decaying at 150 epoch is not the best settings considering the time cost, but not affecting the final results. Since the purpose of the paper is not finding SoTA, it's ok due to the fairness among different optimizers.

p.s. I'd like to close this issue if there's no further doubt.