Open Anner-deJong opened 2 years ago
I'd be happy to help fix if agreed this is a bug, but like the opinion + take on a solution from folks more involved with pytorch lightning. Some simple ideas, but none of them great:
training_epoch_loop.on_advance_end
training_epoch_loop.on_advance_end
that checkpoints
on_train_batch_end
after the increment_completed
Hi, @Anner-deJong - Thank you for creating this issue, and helping with all the context around this. Probably @awaelchli or @carmocca can help you on this one.
Getting back to this issue, as it came up recently in a different context. This is definitely a behavior that needs fixing.
My take would be to not add an extra hook, but invoke increment_completed
prior to on_train_batch_end
. Which one comes before is very debatable from a definition standpoint. Having increments already set up correctly when I'm in my on_train_batch_end
hook is a fair expectation IMO.
The sequence would go from:
self.batch_progress.increment_ready()
on_train_batch_start
self.batch_progress.increment_started()
...
self.batch_progress.increment_processed()
...
on_train_batch_end
self.batch_progress.increment_completed()
to
self.batch_progress.increment_ready()
on_train_batch_start
self.batch_progress.increment_started()
...
self.batch_progress.increment_processed()
...
self.batch_progress.increment_completed()
on_train_batch_end
Not sure what do think about the started
part, but that's for another time.
On further thought, I'll take a more conservative approach since the implications are pretty wide.
🐛 Bug
Found a discrepancy between a continued run after checkpointing, and restoring from checkpoint
Observation:
training_batch / val_loop ordering upon checkpoint restoration not the same as original run after checkpoint saving.
There are still the same amount of train steps, but the validation loops are interleaved at a single step later, which can cause the restored run to end up with one less validation loop (see colab)
Assumption / expectation:
Zero difference between a training run after a checkpoint and a run continued from said checkpoint
Investigation so far:
Im new to some of this lightning code, but IIUC:
Key:
TrainingEpochLoop
'sself.batch_progress.increment_completed()
is called after"on_train_batch_end"
hooks, the latter kicking off checkpoint saving.TrainingEpochLoop.batch_progress.current.reset_on_restart()
will reset theready
back tocompleted
global_step
, which refers toTrainingEpochLoop.batch_loop.optimizer_loop.optim_progress.optimizer.step.total.completed
, has beenincrement_completed()
(called withinTrainingEpochLoop.batch_loop.run
) and thus upon restoring,..optimizer.step.total.ready
is set to an up to dateoptimizer.step.total.completed
, out of sync with the aboveval_check_interval
mode", validation is triggered whenTrainingEpochLoop.batch_progress.current.ready % val_check_interval == 0
(throughTrainingEpochLoop.on_advance_end
->TrainingEpochLoop._should_check_val_fx
batch_progress.current
ready
/completed
counter for the continued and restored runs, end up aligned with differentglobal_step
s, and hence validation triggers at differentglobal_step
sAnother observation:
The following if statement seems to allow for a zero-difference restart, except that just like 4. above,
_should_check_val_fx
wouldnt trigger where in the original run on the checkpointing step it did (although there called inon_advance_end
). Not sure if the original intention of this snippet included the current scopePR's relevant to this line:
9681
9563
Potential impact:
Assuming not too worrisome for the more default Lightning use cases:
val_check_interval
>> 3 (colab example = 3), or that turned off relying instead oncheck_val_every_n_epoch
However, in theory it can influence all of the following:
To Reproduce
customized google colab
bug_report_model.ipynb
with same observation onBoringModel
Expected behavior
Zero difference between a training run continued after a checkpoint and a run continued from said checkpoint
Environment
Note:
v1.6.4
.v1.7.4
master
branch last weeks and the relevant code seems unchangedDetails
CUDA:
Lightning:
Packages:
System:
Additional context
cc @awaelchli @ananthsub @ninginthecloud @rohitgr7 @otaj @carmocca @justusschock