desihub / desitarget

DESI Targeting
BSD 3-Clause "New" or "Revised" License
18 stars 23 forks source link

No override sv3 #767

Closed jalasker closed 2 years ago

jalasker commented 2 years ago

Stops adding the OVERRIDE keyword to the meta dictionary for the SV3(and 2 and 1) and CMX MTLs since the presence of that keyword causes problems downstream in the alternate MTL code.

I would prefer to wait on merging this pull request until I have confirmed that it doesn't create undesired effects in the final alternate MTL bitweight products, but I can confirm that it does have the desired effect in the initial and intermediate stages of the alternate MTL pipeline.

coveralls commented 2 years ago

Coverage Status

Coverage increased (+0.03%) to 57.814% when pulling 8fe9e5e48d30808628c0af5ddd2cebe845133262 on no-override-sv3 into bb5d5c99f187631c3ac9d4a1eb60716a6076d164 on master.

michaelJwilson commented 2 years ago

I was trying to run the SV3 alt ledgers for a student and ran into this. I think a more direct approach is to correct the bool cast used here: https://github.com/desihub/desitarget/blob/cba082b3b73bc0dfc0e8a790d50a8d5151ba1753/py/desitarget/io.py#L2906

which converts OVERRIDE='false' to True for new (alt) ledgers written by write_mtl and read by read_mtl_in_hp. This triggers the override insertions in the predicted filepath when it shouldn't and causes the alt mtl code to then fail (on create_altmtl -> read_mtl_in_hp). Otherwise, I think we run into this again for the main survey alt mtls?

jalasker commented 2 years ago

That's probably cleaner/more precise than what I did, but I figured that since there were no overrides before the main survey (SV3, etc) that removing the keyword would also work fine.

Is what you are suggesting to do something compare OVERRIDE.lower() to 'true' rather than casting? Because to me it seems like we would either need to do that, replace the bool header keyword with a 0/1, or replace the header keywords with "True" and "False" rather than "true" and "false."

The latter two would require changing all of the image headers, so I'm pretty sure that would be a no-go.

geordie666 commented 2 years ago

I also noticed @michaelJwilson's solution, and already partially applied it in my other branch. See the cleanup kwarg here:

https://github.com/desihub/desitarget/blob/6cd285cce778fd6c2f9b4646654f6abd58e52ae7/py/desitarget/io.py#L2831-L2842

and the associated code here:

https://github.com/desihub/desitarget/blob/6cd285cce778fd6c2f9b4646654f6abd58e52ae7/py/desitarget/io.py#L2870-L2880

My suggestion is that we merge this PR (once the whitespace is cleaned up). Then, after #768 is merged I can propagate the cleanup kwarg to read_keyword_from_mtl_header() and that will just fix the casting bug.

If there's some urgency, we could try the git-fu to merge things now. But, I vote for just waiting until everything is merged. I'll add an issue, though.

jalasker commented 2 years ago

Looks mostly good to me, but I would just make it "val.lower()" in both conditionals just so that we don't have to worry about anything from all caps to no caps.

Also I'd make cleanup the default since it seems like that is the intended behavior, while leaving an option to set it as False would let us reproduce things that were originally done incorrectly.

jalasker commented 2 years ago

I don't know if we are going to actually need to merge this PR. My fix is for an issue that should be fixed by your other branch, so I'd recommend merging that first and then, if that doesn't fix the issue for SV3, merging this branch.

I also removed all of the whitespace that you identified in those comments, if there is more, I can't see it.

geordie666 commented 2 years ago

OK, sounds good. Let me extend my fixes in the other branch to try to account for this issue, and then if that just works out-of-the-box we can delete this PR without merging.

I'll implement your .lower() idea in my other branch. I'll also make cleanup the default. I wrestled with that a little bit, as that's not technically backwards compatible. But, I agree that it's probably the correct choice.

jalasker commented 2 years ago

I just realized that adding val.lower() is going to require a precheck for type being string or a try/except block as well since if it runs on anything but a string it will fail.

geordie666 commented 2 years ago

@jalasker: I made the needed changes (to "clean up" reading header values) in my other branch (https://github.com/desihub/desitarget/tree/ADM-purge-MTL). Could you check out that branch and test whether it will work for the purposes of running the alt MTL ledgers for SV3? Thanks!

jalasker commented 2 years ago

The initial step of generating the MTLs is working fine with the new branch as is most of the updating process.

However, I am now timing out at the last step of the update on this comment:

INFO:mtl.py:1165:process_overrides: Processing override ledgers

There shouldn't be any overrides to process since this is SV3. I am not sure if the process is actually hanging or if it is a process that would complete given a longer time allotment. I can dig in more, but that will cost some core hours and time, so I was hoping someone in this thread might be able to tell what is going on.

michaelJwilson commented 2 years ago

Compared to Adam, I'm going to be of no help, but am invested enough to ask: where is the script you call this? Where is the output? When I tried with this new branch I hit a new problem of not finding redshifts, but I assume that was a calling error.

geordie666 commented 2 years ago

@jalasker: When I run the MTL loop on the regular data using the branch that is now in master, I receive your message about overrides, but there's no hanging and the task quickly runs to completion. Here's an example of the output for a test case:

[adamyers@cori07 mtl]$ run_mtl_loop DARK --mtldir /global/cscratch1/sd/adamyers/fakesurveyops2/mtl
INFO:mtl.py:2025:loop_ledger: running on SECONDARY ledger with obscon=DARK and survey=main
INFO:mtl.py:2070:loop_ledger: Update state for 33652 targets (the zcats also contain 669 skies with +ve TARGETIDs)
INFO:mtl.py:416:make_mtl: Ignoring 33145 z entries that aren't in the input target list (e.g. likely skies, secondaries-when-running-primary, primaries-when-running-secondary, etc.)
INFO:mtl.py:424:make_mtl: Ignoring a further 11 zcat entries with NODATA set
INFO:mtl.py:431:make_mtl: Ignoring a further 2 zcat entries with BAD_SPECQA or BAD_PETALQA set
WARNING:mtl.py:437:make_mtl: The size of the zcat has changed, so it won't be modified!
INFO:targets.py:866:calc_priority: 1163 scnd targets to be updated as secondary-only
INFO:targets.py:897:calc_priority: 0 more scnd targets allowed to update MWS primaries
INFO:targets.py:902:calc_priority: 1163 scnd targets to be updated in total
INFO:mtl.py:512:make_mtl: 1041 of 69187 targets have priority <=2, setting N_obs=0.
INFO:mtl.py:587:make_mtl: Done...t=0.2s
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:run_mtl_loop:52:<module>: MTL ledger directory: /global/cscratch1/sd/adamyers/fakesurveyops2/mtl/main/secondary/dark
INFO:run_mtl_loop:53:<module>: MTL tile file: /global/cscratch1/sd/adamyers/fakesurveyops2/mtl/scnd-mtl-done-tiles.ecsv
INFO:run_mtl_loop:54:<module>: Redshift tile file: /global/cscratch1/sd/adamyers/fakesurveyops2/ops/tiles-specstatus.ecsv
INFO:run_mtl_loop:55:<module>: Processed 8 new tiles, which are: [( 3140, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211009')
 ( 4817, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211003')
 ( 6367, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211007')
 ( 6538, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211009')
 ( 8077, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211009')
 ( 9355, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211008')
 ( 9630, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211008')
 (10542, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211008')]
INFO:mtl.py:2028:loop_ledger: running on PRIMARY ledger with obscon=DARK and survey=main
INFO:mtl.py:2070:loop_ledger: Update state for 33652 targets (the zcats also contain 669 skies with +ve TARGETIDs)
INFO:mtl.py:416:make_mtl: Ignoring 1845 z entries that aren't in the input target list (e.g. likely skies, secondaries-when-running-primary, primaries-when-running-secondary, etc.)
INFO:mtl.py:424:make_mtl: Ignoring a further 266 zcat entries with NODATA set
INFO:mtl.py:431:make_mtl: Ignoring a further 69 zcat entries with BAD_SPECQA or BAD_PETALQA set
WARNING:mtl.py:437:make_mtl: The size of the zcat has changed, so it won't be modified!
INFO:targets.py:866:calc_priority: 0 scnd targets to be updated as secondary-only
INFO:targets.py:897:calc_priority: 0 more scnd targets allowed to update MWS primaries
INFO:targets.py:902:calc_priority: 0 scnd targets to be updated in total
INFO:mtl.py:512:make_mtl: 28993 of 600146 targets have priority <=2, setting N_obs=0.
INFO:mtl.py:587:make_mtl: Done...t=3.2s
INFO:run_mtl_loop:52:<module>: MTL ledger directory: /global/cscratch1/sd/adamyers/fakesurveyops2/mtl/main/dark
INFO:run_mtl_loop:53:<module>: MTL tile file: /global/cscratch1/sd/adamyers/fakesurveyops2/mtl/mtl-done-tiles.ecsv
INFO:run_mtl_loop:54:<module>: Redshift tile file: /global/cscratch1/sd/adamyers/fakesurveyops2/ops/tiles-specstatus.ecsv
INFO:run_mtl_loop:55:<module>: Processed 8 new tiles, which are: [( 3140, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211009')
 ( 4817, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211003')
 ( 6367, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211007')
 ( 6538, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211009')
 ( 8077, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211009')
 ( 9355, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211008')
 ( 9630, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211008')
 (10542, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211008')]

The INFO:mtl.py:1165:process_overrides: Processing override ledgers steps are nearly instantaneous. So, basically, I can't recreate your bug. Can you come up with a simple example I can try to run to debug?

If you want to try my example for (a fake copy of) the "real" data:

jalasker commented 2 years ago

I ran my own job again and it wasn't hanging anymore. I think the original hangup was a result of some parallel job taking too long or an I/O collision causing an unexpected delay. (@michaelJwilson I will eventually have documentation up on the wiki with the code that I'm running, but for now I'm not even certain the current version of the altmtl code is in the LSS github. The script is /global/cfs/cdirs/desi/survey/catalogs/SV3/LSS/altmtl/debug_jl/dateLoopAltMTL.sh which calls runAltMTLParallel.py in the same directory.).

However, once the code was running again, it showed some additional errors regarding being unable to find the keyword "NUMOVERRIDE" in one of the SV3 MTLs:

""" Traceback (most recent call last): File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/global/cfs/cdirs/desi/survey/catalogs/SV3/LSS/altmtl/debug_jl/runAltMTLParallel.py", line 65, in procFunc amt.loop_alt_ledger(obscon, survey = survey, mtldir = mtldir, zcatdir = zcatdir, altmtlbasedir = altmtlbasedir, ndirs = ndirs, numobs_from_ledger = numobs_from_ledger,secondary = secondary, getosubp = getosubp, quickRestart = quickRestart, multiproc = multiproc, nproc = nproc, singleDate = singleDate, redoFA = redoFA) File "/global/homes/j/jlasker/.local/desicode/LSS/py/LSS/SV3/altmtltools.py", line 607, in loop_alt_ledger update_ledger(althpdirname, altZCat, obscon=obscon.upper(), File "/global/homes/j/jlasker/.local/lib/python3.8/site-packages/desitarget/mtl.py", line 1472, in update_ledger overmtl = process_overrides(overfn) File "/global/homes/j/jlasker/.local/lib/python3.8/site-packages/desitarget/mtl.py", line 1175, in process_overrides mtl["NUMOVERRIDE"] -= 1 File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/site-packages/astropy/table/table.py", line 1602, in getitem return self.columns[item] File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/site-packages/astropy/table/table.py", line 239, in getitem return OrderedDict.getitem(self, item) KeyError: 'NUMOVERRIDE' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/global/cfs/cdirs/desi/survey/catalogs/SV3/LSS/altmtl/debug_jl/runAltMTLParallel.py", line 97, in result = p.map(procFunc,inds) File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/multiprocessing/pool.py", line 364, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/multiprocessing/pool.py", line 771, in get raise self._value KeyError: 'NUMOVERRIDE'

Adding additional print/log statements into the process_override function and adjacent functions, I was able to find that it was looking for a non-override MTL (e.g. /global/cfs/cdirs/desi/survey/catalogs/SV3/LSS/altmtl/debug_jl/alt_mtls_testAdamOverrideFix_64dirs//Univ056/sv3/dark/sv3mtl-dark-hp-2594.ecsv) in the process_override function.

I added a check to only try to run process_override within update_ledger if the MTL file contained "override" as part of the file name and that seemed to get it to work.

I think the optimal way to fix this would probably be to modify find_mtl_file_format_from_header with the override option so that it only finds override MTLs, but I am not sure how to do that.

jalasker commented 2 years ago

@geordie666 I re-pulled the desitarget code after the recent updates and I'm still getting errors. I also cannot get your example to work. It fails asking for the field "ARCHIVE_DATE"

jlasker@cori05:/global/cscratch1/sd/jlasker/fakesurveyops2/mtl> run_mtl_loop DARK --mtldir /global/cscratch1/sd/jlasker/fakesurveyops2/mtl/ WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.03 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/j/jlasker/.astropy/cache/download/py3/lock.') [astropy.time.core] INFO:mtl.py:2303:loop_ledger: running on SECONDARY ledger with obscon=DARK and survey=main Traceback (most recent call last): File "/global/homes/j/jlasker/.local/bin/run_mtl_loop", line 52, in hpdirname, mtltilefn, tilefn, tiles = loop_ledger( File "/global/homes/j/jlasker/.local/lib/python3.8/site-packages/desitarget/mtl.py", line 2313, in loop_ledger tiles = tiles_to_be_processed(zcatdir, mtltilefn, obscon, survey, File "/global/homes/j/jlasker/.local/lib/python3.8/site-packages/desitarget/mtl.py", line 2035, in tiles_to_be_processed newtiles["ARCHIVEDATE"] = tiles["ARCHIVEDATE"] File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/site-packages/astropy/table/table.py", line 1602, in getitem return self.columns[item] File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/site-packages/astropy/table/table.py", line 239, in getitem return OrderedDict.getitem(self, item) KeyError: 'ARCHIVEDATE'

geordie666 commented 2 years ago

We updated the mtl-done-tiles.ecsv and scnd-mtl-done-tiles.ecsv files to have a new column ARCHIVEDATE. This is to facilitate reprocessing of tiles, as the spectroscopic pipeline is now storing the redshift files in the directory $ZCAT_DIR/tiles/archive/ARCHIVEDATE rather than in the cumulative directory.

The first step in fixing the loop for the alternate MTLs would be to copy across the "official" MTL done files from:

/global/cfs/cdirs/desi/survey/ops/surveyops/trunk/mtl/mtl-done-tiles.ecsv
/global/cfs/cdirs/desi/survey/ops/surveyops/trunk/mtl/scnd-mtl-done-tiles.ecsv

(which contain the ARCHIVEDATE column).

Note that I'm currently working on a PR to complicate this yet further by allowing reprocessed tiles to be passed through MTL. So, you might not want to delve into this too deeply until all of the new MTL infrastructure is in place (until then you could use an earlier tag of desitarget, perhaps?).

geordie666 commented 2 years ago

@jalasker: I'm just getting back to this after the various extra MTL updates.

Do you want to test the current branch and see if we need further updates to facilitate working with the alternate MTLs?

Also, should we close this PR and pick up any needed work in a new branch?

jalasker commented 2 years ago

Sorry, I was traveling home today. I will test the current branch tomorrow and let you know if we still need a fix for the override code with SV3.

jalasker commented 2 years ago

Running the tests you described on the MTLs you described/linked above works for me after grabbing the updated files.

It is also not failing on the alternate MTLs but is not updating any targets.

I believe that I have some updates to make to the alt MTL loop, possibly passing some additional options that weren't present in the original loop_ledger function.

geordie666 commented 2 years ago

@jalasker: Great, glad this works in the short-term. Is it OK to close this PR and work on any other updates in a new branch spawned from the current master branch?

jalasker commented 2 years ago

So I think it is fine to work on the updates in a new branch with the latest code version.

However, I am still getting the errors about looking for "NUMOVERRIDE" even with the latest code. (The reason that it didn't fail last night is that I forgot to re-remove the tiles that I originally removed from the mtl-tiles-done file that I copied from your fakesurveyops2 directory. Once I removed those tiles and it actually tried to update the MTLs it failed with the same error.)

geordie666 commented 2 years ago

OK, let me close this and open an issue instead. I can then work on fixing the NUMOVERRIDE problem in a separate branch.