Resuming training throws the mid-epoch warning everytime

rohitgr7 commented 2 years ago

Proposed refactor

Getting this:

UserWarning: You're resuming from a checkpoint that ended mid-epoch. Training will start from the beginning of the next epoch. This can cause unreliable results if further training is done, consider using an end of epoch checkpoint.

even when checkpoints saved at epoch end are being used to resume the training.

The reason is we set total train batches to inf here: https://github.com/PyTorchLightning/pytorch-lightning/blob/5576fbc5f9a7d0bc71ad26b8b54775110e675808/pytorch_lightning/trainer/trainer.py#L647

and reload dataloaders within fit_loop here: https://github.com/PyTorchLightning/pytorch-lightning/blob/5576fbc5f9a7d0bc71ad26b8b54775110e675808/pytorch_lightning/loops/fit_loop.py#L190-L193

so, num_training_batches is always inf as this point always. https://github.com/PyTorchLightning/pytorch-lightning/blob/5576fbc5f9a7d0bc71ad26b8b54775110e675808/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L246-L253

Pitch

Either remove the warning, since doesn't seem to resolve with the current logic , or start adding a flag in all the checkpoints being saved indicating whether it was saved mid-epoch or not. or any better solutions??

Else it will lead to false-positive warnings for users.

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @justusschock @awaelchli @akihironitta @ananthsub @ninginthecloud

RahulBhalley commented 2 years ago

Thanks @rohitgr7 for reporting this issue for me. Actually, I am having this problem.

The documentation says that the checkpoints are automatically saved at the end of each epoch but it looks Trainer is saving at every iteration step. I even tried creating ModelCheckpoint callback for end epoch checkpoint saving but no luck.

Sorry @rohitgr7, didn't know you're the member of PyTorch Lightning ⚡️. Thanks for the conversation!! Looking towards the resolution.

rohitgr7 commented 2 years ago

@RahulBhalley so are you seeing N checkpoints being created considering you have N training steps?

RahulBhalley commented 2 years ago

@rohitgr7 No, only one checkpoint is created after when the training stops. For example, epoch=357-step=580997.ckpt is the name of checkpoint after some training session. But I don't know if it's really a mid-epoch checkpoint or it was saved at the end of checkpoint.

rohitgr7 commented 2 years ago

well, it's saved at the end only. It's just that lightning adds both epoch and step in the filename by default. The only issue here is the warning that is generated every time which should not be the case :)

RahulBhalley commented 2 years ago

well, it's saved at the end only. It's just that lightning adds both epoch and step in the filename by default.

Cool, doesn't seem like a functionality issue on the user-end. 😃

xsys-technology commented 2 years ago

I am experiencing this same warning using a Trainer that maintains 2 min-val-loss models during training via the following model checkpoint:

checkpoint_callback = ModelCheckpoint(monitor="val_loss",
                                          dirpath=ckpt_dir,
                                          filename=ckpt_filename,
                                          save_top_k=2,
                                          mode="min")

The trainer is correctly creating and updating 2 models ('.ckpt' and '-v1.ckpt'). The big problem is that if i run consecutive training runs (resuming training), passing in the correct 'path_to_best_checkpoint' to fit(),

trainer.fit(model, ckpt_path='path_to_best_checkpoint')

the training and validation losses look like i'm beginning to train a fresh untrained model (not a model that was trained in the previous run). In other words, I'm experiencing a huge training/validation loss discontinuity across consecutive training runs. (Note: i did increase --accumulate_grad_batches by about 10 batches between runs, but I don't see how this could throw off the losses enough to make a well trained model seem like a fresh untrained model).

Is this expected behavior?

rohitgr7 commented 2 years ago

hey @MRGLabs

I realized later that the warning will still appear with another case. But it's already fixed on master. So won't come up again in the next release :)

xsys-technology commented 2 years ago

hey @rohitgr7

Thank you again for the good info :)

shravankumar-concat commented 1 year ago

well, it's saved at the end only. It's just that lightning adds both epoch and step in the filename by default. The only issue here is the warning that is generated every time which should not be the case :)

Could you specify from which version this warning will not pop up.

famura commented 2 months ago

I am still having the same issue and boiled it down to an interference with the LearningRateFinder customized according to this lightning doc page. Without the finder, everything works as expected. However, after finding the learning rate for the first "milestone" dunting training, this message is thrown:

... python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py:161:
You're resuming from a checkpoint that ended before the epoch ended and your dataloader is not resumable. This can cause unreliable results if further training is done. Consider using an end-of-epoch checkpoint or make your dataloader resumable by implementing the `state_dict` / `load_state_dict` interface.

Setting save_on_train_epoch_end=True for the ModelCheckpoint did not help

famura commented 2 months ago

@carmocca (I am tagging you since you closed this issue and I can not reopen it)

Lightning-AI / pytorch-lightning