I'm wondering if there's a clean way to determine if a job failed due to being timed out by slurm, or because of an "actual" error.
As far as I can tell, I have to parse the error message to check if Job not requeued because: timed-out and not checkpointable is included.
That works, but I'd be grateful for any advice if there's a better way to do this.
(The context is that I have a job that should end in X minutes. If the job takes longer than X minutes, it means there's a problem with the input, but I can't diagnose said problem before running the job. So the goal is to let my program continue running if a slurm job failed due to getting timed out.)
Hello,
I'm wondering if there's a clean way to determine if a job failed due to being timed out by slurm, or because of an "actual" error. As far as I can tell, I have to parse the error message to check if
Job not requeued because: timed-out and not checkpointable
is included.That works, but I'd be grateful for any advice if there's a better way to do this.
(The context is that I have a job that should end in X minutes. If the job takes longer than X minutes, it means there's a problem with the input, but I can't diagnose said problem before running the job. So the goal is to let my program continue running if a slurm job failed due to getting timed out.)
Thank you in advance!