auto-resubmit FAILED jobs too?

sbailey commented 3 weeks ago

desispec.workflow.queue.get_resubmission_states does not include FAILED in the list of states to resubmit. desi_resubmit_queue_failures adds FAILED unless --dont-resub-failed is specified, but the auto-resubmission in desi_proc_night will not resubmit FAILED jobs.

The original reasoning might have been that we want to auto-resubmit jobs that fail for NERSC reasons which are usually in the CANCELLED or TIMEOUT state, but not auto-resubmit jobs that failed for algorithmic reasons which are usually in the FAILED state because they would likely just fail again. But 20240818 got into a bad state where an arc job 29535157 failed for a "Fatal error in PMPI_Init_thread: Other MPI error" reason and ended up in a FAILED state which the desi_proc_night scronjob wouldn't resubmit.

Consider adding FAILED to the list of auto-retry states, since desi_proc_night only retries N times (N=2?) before giving up, so it won't get into an infinite resubmit-and-fail loop even if something is genuinely wrong.

akremin commented 3 weeks ago

As hypothesized the distinction in the two codes was intentional, as the resubmission script was meant as a tool to be run by a human after failed jobs had been "cleaned up," whereas the resubmission code built into the pipeline was run without human intervention. At the time, all known FAILED states were caused but issues that could not be resolved by simply retrying the same code on the same data (i.e. issues in data quality, input data, code bugs, etc.).

To directly address one of the questions, the code submits a job initially and then will re-submit it two additional times (for a total of 3 attempts) automatically.

This new situation does make it seem as though there is a use-case for resubmitting even failed jobs. Most cases will be a waste of computing in which the job will try and fail again, but there aren't many jobs that genuinely fail anymore so it will be a negligible part of our computing budget.

Here is my take on the situation: Pros: If it automatically resolves issues even a small fraction of time I would view that as a win since human time is much more valuable.

Cons: The code will give up if any job fails to successfully process after 3 attempts. So the downside is that if a FAILED job is a genuine pipeline failure it will fail two more times and then that will prevent any later jobs on that night from being able to resubmit due to e.g. timeouts. Whereas now we don't resubmit failures and therefore rarely reach 3 attempts on any given job since timeouts are often resolved by resubmission (except in very bad NERSC conditions).

I have no strong opinion on this either way. We can always experiment with the resubmission of FAILED jobs and then turn it off if it becomes more of an issue than a solution.

sbailey commented 2 weeks ago

Cons: The code will give up if any job fails to successfully process after 3 attempts. So the downside is that if a FAILED job is a genuine pipeline failure it will fail two more times and then that will prevent any later jobs on that night from being able to resubmit due to e.g. timeouts.

I don't understand this point. I thought the "up to 3 times" tracking was per job, not per night. It seems that even if one job fails 3 times in a row, that should just count against that job and give up only for that job, but any other jobs that don't depend on that job should still be (re)submitted regardless. Please clarify.

I think we want a situation where

FAILED is in the category of jobs that are auto-resubmitted, even if sometimes that means we algorithmically fail 3x
Resubmission counting is tracked per job such that one job failing 3x doesn't prevent other jobs from being (re)submitted unless they depend upon the failing job.

akremin commented 2 weeks ago

The code logic is that if any job has failed 3 times, the script does not attempt to resubmit any job.

tl;dr We can add FAILED to the list without much problem. Task (2) that you're asking for could be done, but it is not a trivial change and would require careful code changes and thorough testing. I don't think it is worth it for the small corner case that spurred this ticket.

The reason for this is that these jobs don't exist in a vacuum. Often a job is dropped because of a dependency failure. If that is the case, then we first resubmit the dependency and only then do we submit the later job that depended on it using the new Slurm jobid of the resubmitted dependency to have Slurm track the dependencies. This is done recursively as a dependency might have failed because its dependency failed.

If we were to check per-job, then if e.g. an arc failed, we would resubmit that arc 3 times, then submit it three more times when its downstream jobs are resubmitted and it is submitted as a dependency, then the next, etc. So for calibrations, we could end up submitting a job ~3xN times where N is the number of jobs.

We could change how the code recursively submits jobs, but that would be a much larger update to the code and would require extensive testing. We could also consider moving the "failed 3 times" check to be at a much lower level and have the code give up at that point, but that would have implications for desi_resubmit_queue_failures which uses the same underlying code. We need desi_resubmit_queue_failures to have the ability to resubmit jobs after the initial 3 failures, otherwise we won't be able to cleanup certain failure models.

desihub / desispec

auto-resubmit FAILED jobs too? #2329