Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.39k stars 3.38k forks source link

Validation is incorrectly run on resume #20277

Open PiotrDabkowski opened 2 months ago

PiotrDabkowski commented 2 months ago

Bug description

When val_check_interval is used, and the model is resumed the validation is run immediately after resume, wasting resources (causing ckpt save, and whole validation run).

https://github.com/Lightning-AI/pytorch-lightning/discussions/18110

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

No response

Environment

No response

More info

No response

PiotrDabkowski commented 2 months ago

This has been reported once and fixed. But the regression has been introduced in 1.6.0 https://github.com/Lightning-AI/pytorch-lightning/issues/11504

ryxli commented 1 month ago

Was fixed previously with (https://github.com/Lightning-AI/pytorch-lightning/pull/11552):

        # while restarting with no fault-tolerant, batch_progress.current.ready is -1
        if batch_idx == -1:
            return False

batch_idx was removed some time back, now should the logic be?

        if self.restarting:
            return False
PiotrDabkowski commented 1 day ago

@lantiga this is a really problematic issue, just a completely bugged experience, would be nice to get it fixed asap.