Support for Scheduler taking a value to the step function

Alicegaz commented 4 years ago

Looking to the code in training_loop.py it seems like the only scheduler that can take values to the step function is ReduceLROnPlateau, however, there is CosineAnnealingWarmRestarts scheduler and custom schedulers that can take epoch/step id or any other value to the step function.

https://github.com/PyTorchLightning/pytorch-lightning/blob/9ab071588bbe0e24441ae0f07a271bded0e4d9c3/pytorch_lightning/trainer/training_loop.py#L1148

github-actions[bot] commented 4 years ago

Hi! thanks for your contribution!, great first issue!

awaelchli commented 4 years ago

@ananyahjha93 and I discussed this and we think that this change is not necessary, as it does not align with the PyTorch interface of the _LRScheduler base class. Passing in the epoch to step is already deprecated and the _LRScheduler does not take metrics in step. The only exception is the ReduceOnPlateau and custom ones implemented by users. PL supports the former directly, and for the custom ones I recommend that the user simply overrides the optimizer_step hook in LightningModule and pass the metric into the scheduler there.

awaelchli commented 4 years ago

@PyTorchLightning/core-contributors

YannDubs commented 4 years ago

I think it would be a mistake / lack of generality not to let the user decide when a metric should be given to the scheduler. You only need to change one line of code to let the use decide for themselves:

Currently: https://github.com/PyTorchLightning/pytorch-lightning/blob/071e09fe38d6170d8b019101a3c1871aeb1b2e99/pytorch_lightning/trainer/optimizers.py#L97

def configure_schedulers(self, schedulers: list):
        # Convert each scheduler into dict structure with relevant information
        lr_schedulers = []
        default_config = {'interval': 'epoch',  # default every epoch
                          'frequency': 1,  # default every epoch/batch
                          'reduce_on_plateau': False,  # most often not ReduceLROnPlateau scheduler
                          'monitor': 'val_loss'}  # default value to monitor for ReduceLROnPlateau
        for scheduler in schedulers:
            if isinstance(scheduler, dict):
                if 'scheduler' not in scheduler:
                    raise ValueError('Lr scheduler should have key `scheduler`',
                                     ' with item being a lr scheduler')
                scheduler['reduce_on_plateau'] = isinstance(
                    scheduler['scheduler'], optim.lr_scheduler.ReduceLROnPlateau)

One line change to let the user decide (don't override "reduce_on_plateau" so that the user can specify it to ensure that a metric is given to the step()

[...]
            if isinstance(scheduler, dict):
                if 'scheduler' not in scheduler:
                    raise ValueError('...')
                if "reduce_on_plateau" not in scheduler:
                    scheduler['reduce_on_plateau'] = isinstance(
                        scheduler['scheduler'], optim.lr_scheduler.ReduceLROnPlateau)

Of course it would then make sense to rename 'reduce_on_plateau' -(?)-> "access_metric"

Advantages:

The user would then be able to specify that the metric should be sured from the scheduler dictionnary (just like 'interval': 'epoch')
Everything is already there ('monitor': 'val_loss', dealing with step(metric)), it just enables the user to use it for their usecases
It's more general than thinking that ReduceLR will be the only scheduler that will use use the validation step

ananyahjha93 commented 4 years ago

@YannDubs this is actually a good solution. Submit a PR with this? (rename reduce_on_plateau to access_metric) and make changes to the place you mentioned above and to https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pytorch_lightning/trainer/training_loop.py#L1252 where scheduler.step() is called.

Also, add proper documentation for the changes in the PR as they are going to be significant.

awaelchli commented 4 years ago

+1, good suggestion

Alicegaz commented 4 years ago

@YannDubs, @ananyahjha93, @awaelchli regarding your concerns, the use of 'monitor' value in a scheduler dictionary already carries the meaning of passing value to the step function. So there is no mistake. Why should we use another instance 'access_metric', it will add redundancy, as 'monitor' only referenced inside the check for the 'access_metric' and is not used anywhere else except this clause.

Also, YannDubs solution does not help to resolve the problem discussed in this issue. Unlike suggestion in my PR #2851, with this we still will not be able to do like this:

>>> iters = len(dataloader)
>>> for epoch in range(20):
>>>     for i, sample in enumerate(dataloader):
>>>         inputs, labels = sample['inputs'], sample['labels']
>>>         optimizer.zero_grad()
>>>         outputs = net(inputs)
>>>         loss = criterion(outputs, labels)
>>>         loss.backward()
>>>         optimizer.step()
>>>         scheduler.step(epoch + i / iters)

Check out official PyTorch code for the CosineAnnealingWarmRestarts. Suggesting your to reopen the PR #2851! The only thing we can do to make this PR more clear is to add a check for registered schedulers to take in only supported types to the step ("ReduceOnPlateau": "val_*", 'CosineAnnealingWithWarmRestarts': ['epoch', 'iter_frac'], ...) as well as raise an Exception if any for the custom schedulers in cases when they take not supported values.

williamFalcon commented 4 years ago

@ananyahjha93 let’s look at both solutions for a general one.

I’d prefer to let the user have the flexibility on this

YannDubs commented 4 years ago

@Alicegaz using 'access_metric' allows you to keep a meaningful default for the monitor (i.e. valid_loss), but if the number of keys in the dict is a concern I agree it can be dropped.

For your second point I agree my method doesn't help (not sure how important it is given that pytorch is removing those arguments though, but it might be!). I was trying to say that even though pytorch are depreciating arguments to the step function, PL should at least enable other optimizers to depend on the (val) metric --- which is easily done with essentially no difference in the code.

ananyahjha93 commented 4 years ago

@Alicegaz yes, the idea was to use only access_metric and not add it to duplicate monitor, since access_metric is more general than the specific case of ReduceLROnPlateau. But, for the second case, pytorch is deprecating passing epochs to step() function, so I am not too sure that we should keep supporting that. That just requires users to contribute additional logic if they are inheriting _LRScheduler class.

But I would also agree that, not specifically epochs, but custom schedulers should have support for passing arguments to their step function.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

Lightning-AI / pytorch-lightning

Support for Scheduler taking a value to the step function #2849