facebookincubator / submitit

Python 3.8+ toolbox for submitting jobs to Slurm
MIT License
1.3k stars 125 forks source link

Determine if a job failed due to exceeding the time limit #1776

Open lee-jin-gyu96 opened 1 month ago

lee-jin-gyu96 commented 1 month ago

Hello,

I'm wondering if there's a clean way to determine if a job failed due to being timed out by slurm, or because of an "actual" error. As far as I can tell, I have to parse the error message to check if Job not requeued because: timed-out and not checkpointable is included.

That works, but I'd be grateful for any advice if there's a better way to do this.

(The context is that I have a job that should end in X minutes. If the job takes longer than X minutes, it means there's a problem with the input, but I can't diagnose said problem before running the job. So the goal is to let my program continue running if a slurm job failed due to getting timed out.)

Thank you in advance!