Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
27.96k stars 3.35k forks source link

Learning Rate finder too strong loss smoothing #14167

Open hcgasser opened 2 years ago

hcgasser commented 2 years ago

Discussed in https://github.com/Lightning-AI/lightning/discussions/13404

Originally posted by **hcgasser** June 24, 2022 The learning rate finder slowly increases the learning rate during its search process and records how the loss reacts to it. My understanding is that in theory, it is supposed to stay quite constant at the beginning and then decrease before a too high learning rate leads to divergence. However, in the callback method _LRCallback.on_batch_end, a smoothed loss is calculated (link below). The problem here is in my opinion, that the smoothing starts with an initial self.avg_loss of zero. This leads to the counterintuitive behavior that the loss increases at first with learning rate. if the number of tested learning rates is low, this can actually be the case for a wide range of learning rate values - in particular as the standard beta value is set very high (high weight to past). I think, the self.avg_loss value should be set to the initial value of the un-smoothed loss at the beginning instead of zero. What do you think? Thank you for looking into this https://github.com/Lightning-AI/lightning/blob/b84b02400a312240a6429c186cc63514eeb45a82/pytorch_lightning/trainer/lr_finder.py#L374

cc @borda @akihironitta @rohitgr7

awaelchli commented 2 years ago

@hcgasser I think your argumentation makes sense. From my quick look at the code it looks like the smoothed loss is only used for the early stopping condition, so this should only affect the decision where to stop. Is that correct?

Have you tried making changes to the code to start with the initial loss instead of 0?

hcgasser commented 1 year ago

Thanks @awaelchli for your quick response and sorry for taking so much time to answer - was drowning in things to do.

The smoothed_loss variable actually goes into the losses list https://github.com/Lightning-AI/lightning/blob/a5b0f8bd5cd28fbd79fdafa5d9380b00258d7a76/src/pytorch_lightning/tuner/lr_finder.py#L379

which is then read out and stored into the _LRFinder by the lr_find method https://github.com/Lightning-AI/lightning/blob/a5b0f8bd5cd28fbd79fdafa5d9380b00258d7a76/src/pytorch_lightning/tuner/lr_finder.py#L251

and then used by the _LRFinder to suggest the optimal learning rate https://github.com/Lightning-AI/lightning/blob/a5b0f8bd5cd28fbd79fdafa5d9380b00258d7a76/src/pytorch_lightning/tuner/lr_finder.py#L201

I have tried the following changes:

  1. in the init of _LRCallback: set self.avg_loss = None
  2. in the on_train_batch_end of _LRCallback: set self.avg_loss = self.beta self.avg_loss + (1 - self.beta) current_loss if self.avg_loss is not None else current_loss

The result I found was that the loss curve used for the optimal learning rate selection is then very strongly influenced by the first observed smoothed_loss - which is very high given that the network is just seeing its first batch. This compares to being very strongly influenced by zero as of now. Ways to deal with that might be using a lower beta or a warm up period? What do you think?

awaelchli commented 1 year ago

@hcgasser Yes, I think that's a valid observation. How about we ignore the first N steps in the selection among smoothed values (I think you called it warmup above)? We could choose N=1 or similar as the default value.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, PyTorch Lightning Team!

patrontheo commented 2 weeks ago

Any news about this issue @awaelchli ?