Closed sbailey closed 3 weeks ago
Thanks @akremin. I have added updates for all 3 suggestions.
desida is a Python package, so my preference would be to create a file in py/desida
and then have bin/desi_eval_prod_jobs
be a minimialist sys.exit(main())
script. You could also add unit tests easier that way.
I agree with moving scripts into py/desida/scripts when they need to be called from other pipelining code or have pieces to be re-used, and to enable unit tests, but I'm not sure it is worth it for scripts like this that will be run approximately once a year for major prods.
In particular for unit tests I don't think it is worth it to mock up fake slurm return calls etc. since the most likely failure points that we could/should test would be things like a slurm update changing its formatting and breaking parsing, or an update to production formats breaking parsing. That could only be tested at NERSC using a mini-prod built upon recent nights of daily... I'd rather spend time expanding unit test coverage of desispec itself.
I admit that I'm partially motivated with wanting to move on with a contribution of a "minimal viable product" script that works and provides useful functionality, and only add additional features if/when they are needed. @akremin can you be the tie breaker here for long term maintenance considerations of whether it is worth restructuring and adding unit tests now?
Given that the package is currently uniform in its proper usage of minimal scripts and code in py/desida, I will agree with Ben that we should make that adjustment. I understand it is a rarely used script and we want to get this merged and tagged, but ideally it would be 5 minutes of work to move the code and add a calling script.
I do not think we need to add unit tests. It is good enough that we put it in a form where we could add them later on.
My requests have been resolved. I will wait until you respond to Ben's request before testing so that I'm running on the final version.
Thanks. I moved the functions to py/desida/prodjobs.py and make the script a lightweight wrapper.
@sbailey To test this I ran it on Jura and ran into a crash that I traced back to the healpix qid derivation:
Traceback (most recent call last):
File "/global/homes/k/kremin/desi/python/desida/bin/desi_eval_prod_jobs", line 10, in <module>
sys.exit(main())
File "/global/homes/k/kremin/desi/python/desida/py/desida/prodjobs.py", line 185, in main
qinfo = load_qinfo(args.specprod)
File "/global/homes/k/kremin/desi/python/desida/py/desida/prodjobs.py", line 97, in load_qinfo
jobdesc_qinfo = queue_info_from_qids(qids, columns=columns)
File "/global/homes/k/kremin/desi/python/desispec/py/desispec/workflow/queue.py", line 287, in queue_info_from_qids
qids = np.atleast_1d(qids).astype(int)
ValueError: invalid literal for int() with base 10: 'groupspectra'
This happens for jobdesc = 'zpix', which has multiple entries title 'groupspectra'. Those were likely custom jobs due to memory issues in Jura that we were better about in Kibo.
I don't expect this to be an issue in Kibo or future prods, so you can leave it if you wish. But it may help to at a check at line 55 to verify that qid = os.path.splitext(os.path.basename(filename))[0].split('-')[-1]
is an int. Otherwise don't include it in the qid list.
Good catch. I had only tested on Kibo. I added that non-int "qid" check and print a warning message but don't crash.
Thanks for the update. try
-except
seemed like overkill vs if qid.isnumeric()
. However, instead of changing the logic, I've moved the qid parsing into the try
-except
as well so that the code is also robust to issues there. The file parsing logic seems like it would always be correct, but this gives more protection given that we're doing a try
anyway.
I didn't know about str.isnumeric
but indeed I probably would have used it if I had known about it. I can't think of any string that would cause the filename parsing to crash, but putting it into the try/except is fine.
For the record: I was surprised that "1.1".isnumeric()
and "-11".isnumeric()
are both False. i.e. it is testing regex "[0-9]+", not "can this string be converted into a number". That happens to be what we want for this case, but beware of other uses.
I will make one post-facto update to main, to use try / except ValueError. Bare excepts can catch way more than intended, masking other bugs (e.g. if we had forgotten to import os). Note that os.path.splitext
and str.split('-') will always have [0] and [-1] elements, even if there isn't an extension or '-' in the filename, so we shouldn't have IndexErrors or other parsing problems there. If we do, I want to know about it.
Last (?) note: I ran this for Kibo with
module load desida/main
cd $CFS/desi/users/desi/productions/kibo
desi_eval_prod_jobs -s kibo -o kibo-jobs.csv --summary kibo-job-summary.csv
i.e. this is cached in the productions/kibo directory (formerly labeled_proc_run_files/kibo), not spectro/redux/kibo itself.
This PR adds a script
desi_eval_prod_jobs
which will query SLURM for all jobs in a production (including failed, canceled, etc.) so that they can be saved to an output file for later analysis, even after SLURM has forgotten about those jobids (something like 6 months later at NERSC).Example outputs in /global/cfs/cdirs/desi/users/sjbailey/dev/qinfo generated with:
Currently this keeps columns
Most of those are taken directly from sacct, but the following are derived for convenience:
JOBDESC
: production job description, e.g. "ccdcalib", "tilenight", "cumulative", "zpix"NODE_HOURS
: ELAPSED hh:mm:ss string parsed into hours and multiplied by NNODESGPU
: 1/0 for whether this was run on GPU nodes or not, based on parsing CONSTRAINTS@akremin do you think this is the right set of columns to keep? CONSTRAINTS and PARTITION might be extraneous. Are there other columns we should keep?
desi_eval_prod_jobs
also prints a summary of states per jobdesc; that was saved separately into kibo-summary.csv, but it can also be re-generated by re-reading the full jobs table without having to re-query slurm. This was useful for debugging, and also may be useful if we want to update the summary info in a later iteration: