lr_scheduler affect the actual learning rate

Luolc / AdaBound

An optimizer that trains as fast as Adam and as good as SGD.

https://www.luolc.com/publications/adabound/

Apache License 2.0

2.9k stars 330 forks source link

lr_scheduler affect the actual learning rate #9

Open tsaizehua opened 5 years ago

tsaizehua commented 5 years ago

# Applies bounds on actual learning rate
# lr_scheduler cannot affect final_lr, this is a workaround to apply lr decay
final_lr = group['final_lr'] * group['lr'] / base_lr`

However lr_scheduler may change param_group['lr'] during training, therefore the final_lr, lower_bound, upper_bound will also be affected.

Should I not use lr_scheduler and let AbaBound adapts the params to transform from Adam to SGD?

Thank you very much!

Luolc commented 5 years ago

It's a feature. The bounds should be affected by lr scheduler as well. You may refer to the CIFAR-10 demo for the example of using AdaBound together with lr scheduler.

Update: Some intuitive reasons for this design: lr decay is an independent operation that we may apply when using any optimizers. We shouldn't let a specific optimizer override the behavior of lr_scheduler.

tsaizehua commented 5 years ago

Thank you for reply.

I have tried comparing Adam with Adabound on lstm language model. Found that adabound indeed makes learning process more stable, however, the leaning speed and converge rate is slower than Adam. The initial params is shown on screenshot. Should I try higher lr (current is 0.008) next, or any better suggestion? Thank you.

P.S. I also used lr_scheduler, and do scheduler.step(valid_loss) evey 1/5 epoch.

scheduler = t.optim.lr_scheduler.ReduceLROnPlateau(optimizer,
                                                       mode='min',
                                                       factor=0.5,
                                                       patience=3,
                                                       verbose=True)

adabound adam

Luolc commented 5 years ago

however, the leaning speed and converge rate is slower than Adam.

Indeed, there's no guarantee that AdaBound would be faster than Adam, and we never claim that. The only thing I can make sure is: Theoretically, AdaBound would robustly faster than SGD with similar settings.

Regarding your specific situation, maybe you can try a lower transformation speed from Adam to SGD, viz. lower gamma value.

Personally, I would regard this optimizer more like SGD than Adam. Its strength is that it can quickly achieve a relatively small loss, and we may directly fine-tune an SGD at the final stage. As we know, tuning SGD is not that easy, therefore we shouldn't expect a perfect result coming that easy either.

Lastly, we did find that SGD is worse than Adam in some NLP tasks and are still finding why. In this case, I am afraid AdaBound may not outperform Adam. :(