Lightning-AI / pytorch-lightning

Pretrain, finetune and deploy AI models on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.01k stars 3.36k forks source link

Support for Scheduler taking a value to the step function #2849

Closed Alicegaz closed 3 years ago

Alicegaz commented 4 years ago

Looking to the code in training_loop.py it seems like the only scheduler that can take values to the step function is ReduceLROnPlateau, however, there is CosineAnnealingWarmRestarts scheduler and custom schedulers that can take epoch/step id or any other value to the step function.

https://github.com/PyTorchLightning/pytorch-lightning/blob/9ab071588bbe0e24441ae0f07a271bded0e4d9c3/pytorch_lightning/trainer/training_loop.py#L1148

github-actions[bot] commented 4 years ago

Hi! thanks for your contribution!, great first issue!

awaelchli commented 4 years ago

@ananyahjha93 and I discussed this and we think that this change is not necessary, as it does not align with the PyTorch interface of the _LRScheduler base class. Passing in the epoch to step is already deprecated and the _LRScheduler does not take metrics in step. The only exception is the ReduceOnPlateau and custom ones implemented by users. PL supports the former directly, and for the custom ones I recommend that the user simply overrides the optimizer_step hook in LightningModule and pass the metric into the scheduler there.

awaelchli commented 4 years ago

@PyTorchLightning/core-contributors

YannDubs commented 4 years ago

I think it would be a mistake / lack of generality not to let the user decide when a metric should be given to the scheduler. You only need to change one line of code to let the use decide for themselves:

Currently: https://github.com/PyTorchLightning/pytorch-lightning/blob/071e09fe38d6170d8b019101a3c1871aeb1b2e99/pytorch_lightning/trainer/optimizers.py#L97

def configure_schedulers(self, schedulers: list):
        # Convert each scheduler into dict structure with relevant information
        lr_schedulers = []
        default_config = {'interval': 'epoch',  # default every epoch
                          'frequency': 1,  # default every epoch/batch
                          'reduce_on_plateau': False,  # most often not ReduceLROnPlateau scheduler
                          'monitor': 'val_loss'}  # default value to monitor for ReduceLROnPlateau
        for scheduler in schedulers:
            if isinstance(scheduler, dict):
                if 'scheduler' not in scheduler:
                    raise ValueError('Lr scheduler should have key `scheduler`',
                                     ' with item being a lr scheduler')
                scheduler['reduce_on_plateau'] = isinstance(
                    scheduler['scheduler'], optim.lr_scheduler.ReduceLROnPlateau)

One line change to let the user decide (don't override "reduce_on_plateau" so that the user can specify it to ensure that a metric is given to the step()

[...]
            if isinstance(scheduler, dict):
                if 'scheduler' not in scheduler:
                    raise ValueError('...')
                if "reduce_on_plateau" not in scheduler:
                    scheduler['reduce_on_plateau'] = isinstance(
                        scheduler['scheduler'], optim.lr_scheduler.ReduceLROnPlateau)

Of course it would then make sense to rename 'reduce_on_plateau' -(?)-> "access_metric"

Advantages:

ananyahjha93 commented 4 years ago

@YannDubs this is actually a good solution. Submit a PR with this? (rename reduce_on_plateau to access_metric) and make changes to the place you mentioned above and to https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L1252 where scheduler.step() is called.

Also, add proper documentation for the changes in the PR as they are going to be significant.

awaelchli commented 4 years ago

+1, good suggestion

Alicegaz commented 4 years ago

@YannDubs, @ananyahjha93, @awaelchli regarding your concerns, the use of 'monitor' value in a scheduler dictionary already carries the meaning of passing value to the step function. So there is no mistake. Why should we use another instance 'access_metric', it will add redundancy, as 'monitor' only referenced inside the check for the 'access_metric' and is not used anywhere else except this clause.

Also, YannDubs solution does not help to resolve the problem discussed in this issue. Unlike suggestion in my PR #2851, with this we still will not be able to do like this:

>>> iters = len(dataloader)
>>> for epoch in range(20):
>>>     for i, sample in enumerate(dataloader):
>>>         inputs, labels = sample['inputs'], sample['labels']
>>>         optimizer.zero_grad()
>>>         outputs = net(inputs)
>>>         loss = criterion(outputs, labels)
>>>         loss.backward()
>>>         optimizer.step()
>>>         scheduler.step(epoch + i / iters)

Check out official PyTorch code for the CosineAnnealingWarmRestarts. Suggesting your to reopen the PR #2851! The only thing we can do to make this PR more clear is to add a check for registered schedulers to take in only supported types to the step ("ReduceOnPlateau": "val_*", 'CosineAnnealingWithWarmRestarts': ['epoch', 'iter_frac'], ...) as well as raise an Exception if any for the custom schedulers in cases when they take not supported values.

williamFalcon commented 4 years ago

@ananyahjha93 let’s look at both solutions for a general one.

I’d prefer to let the user have the flexibility on this

YannDubs commented 4 years ago

@Alicegaz using 'access_metric' allows you to keep a meaningful default for the monitor (i.e. valid_loss), but if the number of keys in the dict is a concern I agree it can be dropped.

For your second point I agree my method doesn't help (not sure how important it is given that pytorch is removing those arguments though, but it might be!). I was trying to say that even though pytorch are depreciating arguments to the step function, PL should at least enable other optimizers to depend on the (val) metric --- which is easily done with essentially no difference in the code.

ananyahjha93 commented 4 years ago

@Alicegaz yes, the idea was to use only access_metric and not add it to duplicate monitor, since access_metric is more general than the specific case of ReduceLROnPlateau. But, for the second case, pytorch is deprecating passing epochs to step() function, so I am not too sure that we should keep supporting that. That just requires users to contribute additional logic if they are inheriting _LRScheduler class.

But I would also agree that, not specifically epochs, but custom schedulers should have support for passing arguments to their step function.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!