Open PiotrDabkowski opened 2 months ago
This has been reported once and fixed. But the regression has been introduced in 1.6.0 https://github.com/Lightning-AI/pytorch-lightning/issues/11504
Was fixed previously with (https://github.com/Lightning-AI/pytorch-lightning/pull/11552):
# while restarting with no fault-tolerant, batch_progress.current.ready is -1
if batch_idx == -1:
return False
batch_idx was removed some time back, now should the logic be?
if self.restarting:
return False
@lantiga this is a really problematic issue, just a completely bugged experience, would be nice to get it fixed asap.
Bug description
When
val_check_interval
is used, and the model is resumed the validation is run immediately after resume, wasting resources (causing ckpt save, and whole validation run).https://github.com/Lightning-AI/pytorch-lightning/discussions/18110
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
No response
Environment
No response
More info
No response