Closed akremin closed 2 months ago
This looks good, though I'm itching to write a few units tests for it. I'll leave it open until either we have added some tests exercising the caching, or we need to launch a miniprod and/or make a tag and need this to be merged even without unit tests.
This PR reduces the sleep time between job commits from 0.5s to 0.1s. It also introduces caching of Slurm jobid states, which can be checked before job submission instead of querying
sacct
. Since querying one ID or ten ID's was roughly the same amount of time, I made the choice of re-querying all jobids if any jobid in the list is not in the cache. This allows us to update information without loss of time and simplifies the code.The reason we need to check sacct is that Slurm has two databases, one for operations that tracks current jobs, and one long-term database for storing job information. Roughly 10 minutes after a job stops, it is purged from the operational database and can only be identified via the long-term database using
sacct
. At that point a new job submitted to depend on the old job with--dependency=afterok:<JOBID>
will be refused since<JOBID>
is no longer in the operational database and Slurm doesn't know how to depend on it without that information. Thus, we need to remove these finished jobs and only submit running or pending jobs to Slurm.I tested this running one night in my personal prod using:
> desi_proc_night -n 20240409 &>>/global/cfs/cdirs/desi/spectro/redux/kremin/run/20240409.log
The queue was filled with jobs with proper dependencies handled:
And the code is correctly using the cached states rather than querying
sacct
:> grep "get_queue_states_from_qid" 20240409.log
And the
sbatch
calls were correct in that they still contained the dependencies (since they were all just submitted and therefore not complete).`> grep "sbatch" 20240409.log