Open danielzeng-gt opened 9 months ago
@danielzeng-gt Thanks for submitting the issue.
I read your description multiple times but I don't understand the problem. Can you try to formulate it with an example? Is it related to #17912?
Hey Adrian, thanks for the prompt response! I looked at #17912 and it doesn't seem to be related.
I generated an example with GPT4, and I read over it and it is quite accurate in describing the problem. Please let me know if it's still confusing:
Suppose Alice is training a neural network to classify images of cats and dogs on a cloud-based preemptible instance. She's interested in keeping two kinds of checkpoints:
To achieve this, Alice uses two ModelCheckpoint
callbacks as described.
Training Run 1:
last.ckpt
(The latest checkpoint)best_val_confidence-epoch1-val_loss0.5e
(The best checkpoint based on validation loss)Training Resumption:
last.ckpt
and continues training.last.ckpt
(Replacing the older one)best_val_confidence-epoch2-val_loss0.4e
(A new best checkpoint)Expected Behavior:
Since Alice specified save_top_k=1
for the best validation loss checkpoint, she expects to find only one such checkpoint in her directory, i.e., best_val_confidence-epoch2-val_loss0.4e
.
Actual Behavior: Alice finds two best validation loss checkpoints:
best_val_confidence-epoch1-val_loss0.5e
best_val_confidence-epoch2-val_loss0.4e
This indicates that the ModelCheckpoint
callback did not delete the older "best" checkpoint upon resumption, leading to multiple "best" checkpoints being saved.
Implication: This behavior can be problematic especially if Alice runs multiple epochs and faces multiple preemptions. Over time, she would accumulate multiple "best" checkpoints, and it is confusing her when trying to identify the genuine best checkpoint.
The bug seems to arise from a state restoration issue in the ModelCheckpoint
callback when resuming training from a checkpoint. It fails to remember its previous "best" state and does not delete older checkpoints as it should.
I met same issue, I understand that maybe a breaking change, can wee add an option to handle that?
Bug description
Description: When using
ModelCheckpoint
with the parameterstop_k=1
andmonitor='val_loss'
during a singular training run, the behavior is as expected and only retains one 'best_val_confidence-epoch...' checkpoint.However, in the context of cloud-based training where instances may be preempted or restarted from a checkpoint:
ModelCheckpoint
.ModelCheckpoint
state was restored incorrectly.ModelCheckpoint
creates a new checkpoint but fails to delete the old one. Thus, if there's a single preemption/restart during the training run, we end up with two 'best_val_loss' checkpoints.It should be noted we load/write checkpoints to GCS with
fsspec
, which allows for checkpoints to be written to and loaded directly from Google Cloud Storage (GCS).Code Details:
There are two current
ModelCheckpoint
callbacks in use:The first is for saving the latest checkpoint:
The second is for saving the best validation loss checkpoint:
Environment:
What version are you seeing the problem on?
v1.9
How to reproduce the bug
Error messages and logs
Environment
Current environment
``` - Lightning Component: ModelCheckpoint object - PyTorch Lightning Version: 1.9.2 - PyTorch Version: 1.13.0 - Python Version: 3.10.12 - OS: Linux - CUDA/cuDNN version: Build cuda_11.6.r11.6/compiler.31057947_0 - GPU models: Nvidia A100 - How you installed Lightning: Conda - Running environment of LightningApp: Cloud, Running on GCP A100 instance ```More info
No response
cc @carmocca @awaelchli