Open F-Barto opened 3 weeks ago
Hi @F-Barto,
You do not need to specify ckpt_path="hpc"
. In the current setting of lightning (url) it always searches for the "hpc"
path internally first, then if it doesn't find it, it takes the ckpt_path
that you specify in your trainer.fit(..., ckpt_path=ckpt_path)
. So you shouldn't specify ckpt_path="hpc"
and instead just do something like this:
## Option:1
## If you want to have an option for manual resuming
## Have a flag for resuming (args.resume)
# This will do autorequeue (or) resume with last (or) run from start.
trainer.fit(..., ckpt_path="/path/to/saved/checkpoints/last.ckpt" if args.resume else None)
## Option: 2
## No manual resuming, just autoreque (or) run from start
trainer.fit(..., ckpt_path=None)
Typically what happens is when you run a .sh
file and your job is about to hit the wall-time, lightning automatically creates a temporary checkpoint (hpc_ckpt_*.ckpt
) in the default_root_dir
set by the user in trainer
. Then when the job restarts, lightning automatically searches the default_root_dir
folder, and if the hpc_ckpt_*.ckpt
file is there, it would load it and resume training.
I hope this helps and it was clear enough. Let me know if there is something which is still confusing.
Summary
When attempting to resume a job from where it left off before reaching wall-time on a SLURM cluster using PyTorch Lightning, the ckpt_path="hpc" option causes an error if no HPC checkpoint exists yet. This prevents the initial training run from starting.
Expected Behavior
The job should be able to resume from an HPC checkpoint if one exists when using in combination:
SLURMEnvironment(auto_requeue=True, requeue_signal=signal.SIGUSR1)
trainer.fit(model, datamodule=dm, ckpt_path="hpc")
#SBATCH --signal=SIGUSR1@30
If no HPC checkpoint exists (e.g., on the first run), the job should start training from scratch without throwing an error. Currently it throws one:
Current Behavior
Using
ckpt_path=None
allows the job to start but doesn't resume from the HPC checkpoint when one is created.If I use
trainer.fit(model, datamodule=dm, ckpt_path=None)
, the SIGUSR1 is correctly catched and the checkpointhpc_ckpt_1.ckpt
correctly created. However the checkpoint is not used which is expected because we leftckpt_path=None
.Using
ckpt_path="hpc"
throws an error if no HPC checkpoint is found, preventing the initial training run.The logic of looking for and loading the hpc checkpoint from what I understood should be handled by setting
ckpt_path="hpc"
However, as can be seen in https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/trainer/connectors/checkpoint_connector.py#L193C1-L199C46 if an hpc ckpt is not found it throws an error and stops:The issue is that for the very first training of course there would be no hpc checkpoint because we haven't started any training yet
Relevant issues
16639
What version are you seeing the problem on?
v2.4
How to reproduce the bug
dummy_model.py
dummy_slurm.sh
Environment
Current environment
``` - PyTorch Lightning Version: 2.4.0 - PyTorch Version: 2.4.0 - Python version: 3.11.9 - How you installed Lightning(`conda`, `pip`, source): pip ```