desi_proc_night production mode checking sacct

sbailey commented 2 months ago

When desi_proc_night -n NIGHT is submitting an entire night in non-daily mode, it checks SLURM job dependencies with sacct prior to submitting each script, e.g.

INFO:processing.py:582:create_batch_script: Outfile is: /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/badampcals/run/scripts/tiles/cumulative/10792/20240408/ztile-10792-thru20240408.slurm
INFO:queue.py:243:queue_info_from_qids: Querying Slurm with the following: sacct -X --parsable2 --delimiter=, --format=jobid,state -j 24595817
INFO:processing.py:703:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24595817', '/global/cfs/cdirs/desi/users/sjbailey/spectro/redux/badampcals/run/scripts/tiles/cumulative/10792/20240408/ztile-10792-thru20240408.slurm']
INFO:processing.py:704:submit_batch_script: Submitted ztile-10792-thru20240408.slurm with dependencies --dependency=afterok:24595817 and reservation=None. Returned qid: 24595819

Since this check adds a non-negligible delay to each submission, is this really necessary in production mode?

It is necessary in daily mode and for desi_resubmit_queue_failures, since slurm has a somewhat short memory for tracking --dependency=afterok:JOBID and it might accept and run the new job if JOBID failed more than N hours ago (I forget what N is, but is is small enough that we had to put in explicit checks ourselves).

But for production mode, the dependent job was likely submitted just a few minutes ago, and even if it already ran and failed slurm would still remember it. Possible exceptions:

if the dependency JOBID already ran and failed, the sbatch ... --dependency=afterok:JOBID would fail at submission time, causing desi_proc_night to crash instead of submitting all jobs. That requires more cleanup than just running desi_resubmit_queue_failures
linkcals has cross night dependencies, if we had a long delay between submitting one night (with cals) and the next night (that needed to link cals), it could incorrectly run the second night linkcal job. This is also the case for multi-night tiles and cumulative redshifts, where we just run desi_resubmit_queue failures and move on.

Even if there are corner cases where strictly we would need to check these ourselves, we might still consider a desi_proc_night --skip-depcheck option to skip this usually-unnecessary check to submit nights more rapidly.

@kremin, what do you think?

akremin commented 2 months ago

My opinion changes depending on the magnitude of the delays. If we're talking about a minute per night, I may not think it's worth it. But if it is several minutes, then I would want to add this fix. This does require passing the new option down several levels to a low-level function.

If memory serves me, the job state is actually forgetting within minutes (10-20 perhaps?). But that is sufficient time in most cases for all jobs to be submitted. Once jobs are submitted, any subsequent dependency failures trigger the jobs to be canceled, as desired.

All of your assessments are correct. Given that most linking is to the most recent previous night, hopefully both nights are submitted within 20 minutes of each other. We'd need to be particularly unlucky to have a linked night not submitted within 20 mins and also having a failure the second night depended on.

If we do move forward with the flag: I agree that opt-out (--skip-depcheck) is the logical choice since we want to be checking by default, but since --daily can set flags for us we could consider having it opt-in (--do-depcheck) such that under normal use we never need to explicitly set it on the command line or in scripts (because it would be implicit in --daily). Or use the boolean flag type where if not set explicitly it is None and the code does the right thing in each circumstance.

sbailey commented 2 months ago

Improvements implemented by @akremin in PR #2228 by keeping a cache of job states so that we don't have to re-query sacct for every sbatch call.

desihub / desispec

desi_proc_night production mode checking sacct #2220