Reduce submission times with lower sleep and slurm state caching

This PR reduces the sleep time between job commits from 0.5s to 0.1s. It also introduces caching of Slurm jobid states, which can be checked before job submission instead of querying sacct. Since querying one ID or ten ID's was roughly the same amount of time, I made the choice of re-querying all jobids if any jobid in the list is not in the cache. This allows us to update information without loss of time and simplifies the code.

The reason we need to check sacct is that Slurm has two databases, one for operations that tracks current jobs, and one long-term database for storing job information. Roughly 10 minutes after a job stops, it is purged from the operational database and can only be identified via the long-term database using sacct. At that point a new job submitted to depend on the old job with --dependency=afterok:<JOBID> will be refused since <JOBID> is no longer in the operational database and Slurm doesn't know how to depend on it without that information. Thus, we need to remove these finished jobs and only submit running or pending jobs to Slurm.

I tested this running one night in my personal prod using: > desi_proc_night -n 20240409 &>>/global/cfs/cdirs/desi/spectro/redux/kremin/run/20240409.log

The queue was filled with jobs with proper dependencies handled:

JOBID            ST USER      NAME          NODES TIME_LIMIT       TIME  SUBMIT_TIME          QOS             START_TIME           FEATURES       NODELIST(REASON

24969241         PD kremin    tilenight-20  1          21:00       0:00  2024-04-29T16:41:22  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969239         PD kremin    ztile-6200-t  1          25:00       0:00  2024-04-29T16:41:21  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969238         PD kremin    tilenight-20  1          21:00       0:00  2024-04-29T16:41:21  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969237         PD kremin    ztile-20356-  1          25:00       0:00  2024-04-29T16:41:20  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969235         PD kremin    tilenight-20  1          21:00       0:00  2024-04-29T16:41:19  gpu_regular     N/A                  gpu&a100       (Dependency)   

[...]

24969212         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:41:02  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969211         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:40:59  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969208         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:40:58  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969194         PD kremin    flat-2024040  1          20:00       0:00  2024-04-29T16:40:52  gpu_regular     N/A                  gpu&a100       (Dependency)   
24969224         PD kremin    nightlyflat-  1           5:00       0:00  2024-04-29T16:41:12  regular_1       N/A                  cpu            (Dependency)   
24969187         PD kremin    psfnight-202  1           5:00       0:00  2024-04-29T16:40:47  regular_1       N/A                  cpu            (Dependency)   
24969186         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:43  regular_1       N/A                  cpu            (Dependency)   
24969182         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:36  regular_1       N/A                  cpu            (Dependency)   
24969180         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:31  regular_1       N/A                  cpu            (Dependency)   
24969179         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:28  regular_1       N/A                  cpu            (Dependency)   
24969177         PD kremin    arc-20240409  3          45:00       0:00  2024-04-29T16:40:26  regular_1       N/A                  cpu            (Dependency)   
24969175         PD kremin    ccdcalib-202  2          15:00       0:00  2024-04-29T16:40:25  regular_1       N/A                  cpu            (Priority)

And the code is correctly using the cached states rather than querying sacct: > grep "get_queue_states_from_qid" 20240409.log

INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969175]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969177, 24969179, 24969180, 24969182, 24969186]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969187]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969187]) are cached. Using cached values.

[...]

INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969233]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969224]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969235]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969224]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969238]) are cached. Using cached values.
INFO:queue.py:306:get_queue_states_from_qids: All Slurm qids=array([24969224]) are cached. Using cached values.

And the sbatch calls were correct in that they still contained the dependencies (since they were all just submitted and therefore not complete).

`> grep "sbatch" 20240409.log

INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/ccdcalib-20240409-00235120-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235125-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235126-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235127-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235128-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969175', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/arc-20240409-00235129-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969177:24969179:24969180:24969182:24969186', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/psfnight-20240409-00235125-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969187', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/flat-20240409-00235141-a0123456789.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969187', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/flat-20240409-00235142-a0123456789.slurm']

[...]

INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969224', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/tilenight-20240409-20329.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969233', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/tiles/cumulative/20329/20240409/ztile-20329-thru20240409.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969224', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/tilenight-20240409-20356.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969235', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/tiles/cumulative/20356/20240409/ztile-20356-thru20240409.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969224', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/night/20240409/tilenight-20240409-6200.slurm']
INFO:processing.py:709:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:24969238', '/global/cfs/cdirs/desi/spectro/redux/kremin/run/scripts/tiles/cumulative/6200/20240409/ztile-6200-thru20240409.slurm']

desihub / desispec

Reduce submission times with lower sleep and slurm state caching #2228