desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
33 stars 24 forks source link

make desi_resubmit_zpix aware of the queue #2270

Open sbailey opened 1 month ago

sbailey commented 1 month ago

Currently desi_resubmit_zpix compares healpix slurm files on disk to what output files that should produce, but it doesn't distinguish between files missing because the job crashed / timed-out, vs. jobs that are still in the queue. Update desi_resubmit_zpix so that it can distinguish these cases and only resubmit jobs that have already run and failed.

Two options:

  1. update desi_healpix_redshifts to generate DB-like files analogous to the tile-based processing_tables to track which healpix have been submitted with which JOBIDs and their state and history; plus update desi_resubmit_zpix to know about that and update it as needed.
  2. update desi_healpix_redshifts to query the queue and use jobnames to derive which ones are still running/pending.

(1) is more complete, but also requires close coordination between two different scripts and is more restrictive because wouldn't track any runs that happen outside of the context of desi_healpix_redshifts + desi_resubmit_zpix. e.g. in Jura I have rerun several failed zpix in the interactive queue by just sourcing their slurm file rather than waiting behind thousands of other jobs in the queue. That would break the model for (1), though perhaps that could be accommodated with logic like "if the files exist, we don't care why; but if they don't exist, we use the queue table to check if we know why they aren't there (job hasn't run vs. job crashed)".

(2) is more flexible, but would break if we were ever running two prods at the same time, since it can't tell from the jobname alone which prod it goes with.

Or maybe there is some 3rd option. Pinging @akremin for ideas since he put together the tile-based job tracking infrastructure.

akremin commented 1 month ago

As the person who implemented the tile-based version of (1), I am biased toward that solution. Though I acknowledge the cited drawback -- it is nice to have additional flexibility to run things in e.g. an interactive node without needing additional hacking of the database entries. I think those instances are rare enough that (1) is viable and in some sense the most robust. We can also potentially reuse some of the resubmission logic used for the tile-based processing.

(2) could also work. We could get around the drawback of multiple productions by adding SPECPROD to the jobnames. jobnames are user-defined in the scripts and I don't think they're used elsewhere.

I can't think of another approach to query Slurm for job status, and I don't think we should be inventing our own job tracking, so the two options above are the most straightforward.