Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.37k stars 3.38k forks source link

KFold cross-validation example: optimizer and scheduler are not reset across folds #12409

Closed carlobretti closed 1 year ago

carlobretti commented 2 years ago

🐛 Bug

When self.trainer.strategy.setup_optimizers(self.trainer) is called at the end of each fold in on_advance_end, the value of self.trainer.state.fn is still TrainerFn.TESTING.

https://github.com/PyTorchLightning/pytorch-lightning/blob/5d156f4ff62e02a94b0ffe497472e7035149e9a4/pl_examples/loop_examples/kfold.py#L201-L207

Since self.trainer.state.fn = TrainerFn.TESTING, _init_optimizers_and_lr_schedulers is never called, as determined by setup_optimizers(), so the optimizer and scheduler are never reset across different folds.

https://github.com/PyTorchLightning/pytorch-lightning/blob/5d156f4ff62e02a94b0ffe497472e7035149e9a4/pytorch_lightning/strategies/strategy.py#L121-L131

To Reproduce

Run the kfold.py example, and check how learning rate changes using the LearningRateMonitor callback.

Expected behavior

Expected optimizer and scheduler to be reset at the end every fold.

Environment

Additional context

carmocca commented 2 years ago

Would you be interested in working on a PR to fix it?

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

ninivert commented 2 years ago

For anyone encountering this, a quick hack to get it to work would to replace https://github.com/Lightning-AI/lightning/blob/28e18881a9ad2298169c78ad9ae109191e201c2f/examples/pl_loops/kfold.py#L208 with

old_state = self.trainer.state.fn  # HACK
self.trainer.state.fn = TrainerFn.FITTING  
self.trainer.strategy.setup_optimizers(self.trainer)  # https://github.com/Lightning-AI/lightning/issues/12409
self.trainer.state.fn = old_state

Might be a good idea to reset the model in the on_advance_start, instead of in on_advance_end (so we don't rely on the previous iteration finishing correctly). Calling _reset_fitting doesn't work in on_advance_start doesn't work because dataloaders are reset (and I don't fully understand the intricacies of the internal callbacks)

Using pytorch-lightning==1.7.5

ninivert commented 2 years ago

Another bug I encountered was the callbacks not being reinitialized, so I added a way of restoring the state of the callbacks

class KFoldLoop(Loop):
    # ...
    def on_run_start(self, *args: Any, **kwargs: Any) -> None:
        # ...
        self.callback_state_dicts = [ deepcopy(callback.state_dict()) for callback in self.trainer.callbacks ]

    def on_advance_end(self) -> None:
        # ...
        # restore the callbacks
        for callback, state_dict in zip(self.trainer.callbacks, self.callback_state_dicts):
            callback.load_state_dict(deepcopy(state_dict))  # deepcopy because ``ModelCheckpoint`` assigns without copying
awaelchli commented 1 year ago

We now have a new k-fold example where for each split we create a new optimizer. https://github.com/Lightning-AI/lightning/tree/master/examples/fabric/kfold_cv

I'm closing the issue due to its age, but of course any further feedback is appreciated!