Closed sbailey closed 2 months ago
My opinion changes depending on the magnitude of the delays. If we're talking about a minute per night, I may not think it's worth it. But if it is several minutes, then I would want to add this fix. This does require passing the new option down several levels to a low-level function.
If memory serves me, the job state is actually forgetting within minutes (10-20 perhaps?). But that is sufficient time in most cases for all jobs to be submitted. Once jobs are submitted, any subsequent dependency failures trigger the jobs to be canceled, as desired.
All of your assessments are correct. Given that most linking is to the most recent previous night, hopefully both nights are submitted within 20 minutes of each other. We'd need to be particularly unlucky to have a linked night not submitted within 20 mins and also having a failure the second night depended on.
If we do move forward with the flag: I agree that opt-out (--skip-depcheck
) is the logical choice since we want to be checking by default, but since --daily
can set flags for us we could consider having it opt-in (--do-depcheck
) such that under normal use we never need to explicitly set it on the command line or in scripts (because it would be implicit in --daily
). Or use the boolean flag type where if not set explicitly it is None
and the code does the right thing in each circumstance.
Improvements implemented by @akremin in PR #2228 by keeping a cache of job states so that we don't have to re-query sacct for every sbatch call.
When
desi_proc_night -n NIGHT
is submitting an entire night in non-daily mode, it checks SLURM job dependencies with sacct prior to submitting each script, e.g.Since this check adds a non-negligible delay to each submission, is this really necessary in production mode?
It is necessary in daily mode and for desi_resubmit_queue_failures, since slurm has a somewhat short memory for tracking
--dependency=afterok:JOBID
and it might accept and run the new job if JOBID failed more than N hours ago (I forget what N is, but is is small enough that we had to put in explicit checks ourselves).But for production mode, the dependent job was likely submitted just a few minutes ago, and even if it already ran and failed slurm would still remember it. Possible exceptions:
sbatch ... --dependency=afterok:JOBID
would fail at submission time, causingdesi_proc_night
to crash instead of submitting all jobs. That requires more cleanup than just runningdesi_resubmit_queue_failures
Even if there are corner cases where strictly we would need to check these ourselves, we might still consider a
desi_proc_night --skip-depcheck
option to skip this usually-unnecessary check to submit nights more rapidly.@kremin, what do you think?