facebookincubator / submitit

Python 3.8+ toolbox for submitting jobs to Slurm
MIT License
1.3k stars 125 forks source link

Enabling sbatch file re-use. #1739

Open alexnwang opened 1 year ago

alexnwang commented 1 year ago

I'm interested in re-using sbatch files to re-submit jobs that have crashed. However, the .sh SBATCH file and the .pkl file all are tied to a single SLURM_JOBID. This makes re-using the .sh file to re-launch a job infeasible.

It'd be appreciate if there could be some way to relax this requirement and not have it tied to the JOBID.

gwenzek commented 1 year ago

It's also a long standing painpoint for me, but I need to think a bit more about this. The job id thing is useful because it means that sacct and squeue information is directly relatable to the on disk files. But it means that restarting is a pain.

The nicest way would be to modify the sbatch file itself so that you can run sbatch several times on it. One workaround would be to have a CLI submitit restart 102984 that would restart a previous submitit job file.

alexnwang commented 1 year ago

Yeah, I just setup my directories such that if I submitted a job using the exact same parameters again it'll run in the same dir. Running out of the same dir will just have it pick up where it left off and have another set of submitit files corresponding to the re-run.