N8-CIR-Bede / documentation

Documentation for the N8CIR Bede Tier 2 HPC faciltiy
https://bede-documentation.readthedocs.io/en/latest/
7 stars 11 forks source link

Automated Job Restarts (req checkpointing) #155

Open ptheywood opened 1 year ago

ptheywood commented 1 year ago

Jobs which require more than the 2 day maximum job runtime to complete need to be run using multiple jobs (and some form of checkpointing to resume state and continue).

It would be useful to document this, with atleast one example of a slurm job submission script on the usage page (it may be worth splitting this at some point as it grows).

It may also be useful to provide an example on the tensorflow / pytorch pages which do have checkpointing built in, and may be a common use case.

A number of solutions have been proposed in the on-going email thread, including use of sequential array jobs, DMTCP with sbatch --requeue --open-mode=append & --signal=B:USR1@60.

Encouraging use of functionality to avoid infinitely requeing jobs (probably with a warning level block for good measure) will be worthwhile (slurm tracks the restart count for jobs).


This was raised by @loveshack on the mailing list.