desihub / desida

Scripts and configuration for DESI Data Assemblies and Releases.
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

add script desi_eval_prod_jobs to summarize production job info #14

Closed sbailey closed 3 weeks ago

sbailey commented 3 weeks ago

This PR adds a script desi_eval_prod_jobs which will query SLURM for all jobs in a production (including failed, canceled, etc.) so that they can be saved to an output file for later analysis, even after SLURM has forgotten about those jobids (something like 6 months later at NERSC).

Example outputs in /global/cfs/cdirs/desi/users/sjbailey/dev/qinfo generated with:

desi_eval_prod_jobs -s kibo -o kibo-jobs.csv --summary kibo-summary.csv

Currently this keeps columns

JOBID,JOBNAME,PARTITION,CONSTRAINTS,NNODES,SUBMIT,ELIGIBLE,START,END,ELAPSED,STATE,EXITCODE,JOBDESC,NODE_HOURS,GPU

Most of those are taken directly from sacct, but the following are derived for convenience:

@akremin do you think this is the right set of columns to keep? CONSTRAINTS and PARTITION might be extraneous. Are there other columns we should keep?

desi_eval_prod_jobs also prints a summary of states per jobdesc; that was saved separately into kibo-summary.csv, but it can also be re-generated by re-reading the full jobs table without having to re-query slurm. This was useful for debugging, and also may be useful if we want to update the summary info in a later iteration:

[login29 qinfo] desi_eval_prod_jobs -i kibo-jobs.csv
  JOBDESC   CPUGPU NODE_HOURS PERCENT COMPLETED TIMEOUT FAILED CANCELLED NODE_FAIL
----------- ------ ---------- ------- --------- ------- ------ --------- ---------
    linkcal    cpu        0.1     0.0        17       0      1         5         0
nightlybias    cpu        0.2     0.0         3       0      0         0         0
   ccdcalib    cpu      120.1     1.1       768       5     10         6         0
        arc    cpu     1758.0    15.4      3834       7     69        90         0
   psfnight    cpu       15.4     0.1       767       4      0        80         0
       flat    gpu     1527.3    13.4      9803      17      8      1064         0
nightlyflat    gpu       30.2     0.3       762      10      1       225         0
  tilenight    gpu     4076.1    35.8     17124      23     11      5711         1
 cumulative    gpu     2658.8    23.3     14179       7    168      4838         1
       zpix    gpu     1210.5    10.6      5698       1     11        10         0
sbailey commented 3 weeks ago

Thanks @akremin. I have added updates for all 3 suggestions.

weaverba137 commented 3 weeks ago

desida is a Python package, so my preference would be to create a file in py/desida and then have bin/desi_eval_prod_jobs be a minimialist sys.exit(main()) script. You could also add unit tests easier that way.

sbailey commented 3 weeks ago

I agree with moving scripts into py/desida/scripts when they need to be called from other pipelining code or have pieces to be re-used, and to enable unit tests, but I'm not sure it is worth it for scripts like this that will be run approximately once a year for major prods.

In particular for unit tests I don't think it is worth it to mock up fake slurm return calls etc. since the most likely failure points that we could/should test would be things like a slurm update changing its formatting and breaking parsing, or an update to production formats breaking parsing. That could only be tested at NERSC using a mini-prod built upon recent nights of daily... I'd rather spend time expanding unit test coverage of desispec itself.

I admit that I'm partially motivated with wanting to move on with a contribution of a "minimal viable product" script that works and provides useful functionality, and only add additional features if/when they are needed. @akremin can you be the tie breaker here for long term maintenance considerations of whether it is worth restructuring and adding unit tests now?

akremin commented 3 weeks ago

Given that the package is currently uniform in its proper usage of minimal scripts and code in py/desida, I will agree with Ben that we should make that adjustment. I understand it is a rarely used script and we want to get this merged and tagged, but ideally it would be 5 minutes of work to move the code and add a calling script.

I do not think we need to add unit tests. It is good enough that we put it in a form where we could add them later on.

My requests have been resolved. I will wait until you respond to Ben's request before testing so that I'm running on the final version.

sbailey commented 3 weeks ago

Thanks. I moved the functions to py/desida/prodjobs.py and make the script a lightweight wrapper.

akremin commented 3 weeks ago

@sbailey To test this I ran it on Jura and ran into a crash that I traced back to the healpix qid derivation:

Traceback (most recent call last):
  File "/global/homes/k/kremin/desi/python/desida/bin/desi_eval_prod_jobs", line 10, in <module>
    sys.exit(main())
  File "/global/homes/k/kremin/desi/python/desida/py/desida/prodjobs.py", line 185, in main
    qinfo = load_qinfo(args.specprod)
  File "/global/homes/k/kremin/desi/python/desida/py/desida/prodjobs.py", line 97, in load_qinfo
    jobdesc_qinfo = queue_info_from_qids(qids, columns=columns)
  File "/global/homes/k/kremin/desi/python/desispec/py/desispec/workflow/queue.py", line 287, in queue_info_from_qids
    qids = np.atleast_1d(qids).astype(int)
ValueError: invalid literal for int() with base 10: 'groupspectra'

This happens for jobdesc = 'zpix', which has multiple entries title 'groupspectra'. Those were likely custom jobs due to memory issues in Jura that we were better about in Kibo.

I don't expect this to be an issue in Kibo or future prods, so you can leave it if you wish. But it may help to at a check at line 55 to verify that qid = os.path.splitext(os.path.basename(filename))[0].split('-')[-1] is an int. Otherwise don't include it in the qid list.

sbailey commented 3 weeks ago

Good catch. I had only tested on Kibo. I added that non-int "qid" check and print a warning message but don't crash.

akremin commented 3 weeks ago

Thanks for the update. try-except seemed like overkill vs if qid.isnumeric(). However, instead of changing the logic, I've moved the qid parsing into the try-except as well so that the code is also robust to issues there. The file parsing logic seems like it would always be correct, but this gives more protection given that we're doing a try anyway.

sbailey commented 3 weeks ago

I didn't know about str.isnumeric but indeed I probably would have used it if I had known about it. I can't think of any string that would cause the filename parsing to crash, but putting it into the try/except is fine.

For the record: I was surprised that "1.1".isnumeric() and "-11".isnumeric() are both False. i.e. it is testing regex "[0-9]+", not "can this string be converted into a number". That happens to be what we want for this case, but beware of other uses.

I will make one post-facto update to main, to use try / except ValueError. Bare excepts can catch way more than intended, masking other bugs (e.g. if we had forgotten to import os). Note that os.path.splitext and str.split('-') will always have [0] and [-1] elements, even if there isn't an extension or '-' in the filename, so we shouldn't have IndexErrors or other parsing problems there. If we do, I want to know about it.

sbailey commented 3 weeks ago

Last (?) note: I ran this for Kibo with

module load desida/main
cd $CFS/desi/users/desi/productions/kibo
desi_eval_prod_jobs -s kibo -o kibo-jobs.csv --summary kibo-job-summary.csv

i.e. this is cached in the productions/kibo directory (formerly labeled_proc_run_files/kibo), not spectro/redux/kibo itself.