EarlyStopping interfered by LearningRateFinder #19575

Open zhf231298 opened 6 months ago

zhf231298 commented 6 months ago

Bug description

When EarlyStopping is used together with LearningRateFinder, the early stopping check is triggered $n$ steps before the validation, where $n$ is the number of steps executed by the learning rate finder. This could be an issue when the early stopping check is based on a validation metric, as at the time of early stopping check the validation metric has not been computed yet.

What version are you seeing the problem on?

v2.2, master

How to reproduce the bug

callbacks = [
    EarlyStopping(monitor="val/loss", patience=10),
trainer = Trainer(

Error messages and logs

RuntimeError: Early stopping conditioned on metric `val/loss` which is not available. Pass in or modify your `EarlyStopping` callback to use any of the following: `train/loss`, `train/raw_loss`, `train/mae`, `train/raw_mae`, `train/mae_improvement`


More info

I have not tried to reproduce this error with other callback options, but I think that it could potentially cause the same issue with other callbacks that runs the network before the actual start of the fitting.

famura commented 2 months ago

I just encountered the same issue using pytorch-lightning version 2.2.4.

It seems like the learning rate finder iterations are counting towards some counter that triggers the on_advance_end callback which then runs into a problem when the early stopping callback can't find its metric because we only log it during the validation step and not the training step.