Closed jalasker closed 2 years ago
I was trying to run the SV3 alt ledgers for a student and ran into this. I think a more direct approach is to correct the bool cast used here: https://github.com/desihub/desitarget/blob/cba082b3b73bc0dfc0e8a790d50a8d5151ba1753/py/desitarget/io.py#L2906
which converts OVERRIDE='false' to True for new (alt) ledgers written by write_mtl and read by read_mtl_in_hp. This triggers the override insertions in the predicted filepath when it shouldn't and causes the alt mtl code to then fail (on create_altmtl -> read_mtl_in_hp). Otherwise, I think we run into this again for the main survey alt mtls?
That's probably cleaner/more precise than what I did, but I figured that since there were no overrides before the main survey (SV3, etc) that removing the keyword would also work fine.
Is what you are suggesting to do something compare OVERRIDE.lower() to 'true' rather than casting? Because to me it seems like we would either need to do that, replace the bool header keyword with a 0/1, or replace the header keywords with "True" and "False" rather than "true" and "false."
The latter two would require changing all of the image headers, so I'm pretty sure that would be a no-go.
I also noticed @michaelJwilson's solution, and already partially applied it in my other branch. See the cleanup
kwarg here:
and the associated code here:
My suggestion is that we merge this PR (once the whitespace is cleaned up). Then, after #768 is merged I can propagate the cleanup
kwarg to read_keyword_from_mtl_header()
and that will just fix the casting bug.
If there's some urgency, we could try the git-fu to merge things now. But, I vote for just waiting until everything is merged. I'll add an issue, though.
Looks mostly good to me, but I would just make it "val.lower()" in both conditionals just so that we don't have to worry about anything from all caps to no caps.
Also I'd make cleanup the default since it seems like that is the intended behavior, while leaving an option to set it as False would let us reproduce things that were originally done incorrectly.
I don't know if we are going to actually need to merge this PR. My fix is for an issue that should be fixed by your other branch, so I'd recommend merging that first and then, if that doesn't fix the issue for SV3, merging this branch.
I also removed all of the whitespace that you identified in those comments, if there is more, I can't see it.
OK, sounds good. Let me extend my fixes in the other branch to try to account for this issue, and then if that just works out-of-the-box we can delete this PR without merging.
I'll implement your .lower()
idea in my other branch. I'll also make cleanup
the default. I wrestled with that a little bit, as that's not technically backwards compatible. But, I agree that it's probably the correct choice.
I just realized that adding val.lower() is going to require a precheck for type being string or a try/except block as well since if it runs on anything but a string it will fail.
@jalasker: I made the needed changes (to "clean up" reading header values) in my other branch (https://github.com/desihub/desitarget/tree/ADM-purge-MTL). Could you check out that branch and test whether it will work for the purposes of running the alt MTL ledgers for SV3? Thanks!
The initial step of generating the MTLs is working fine with the new branch as is most of the updating process.
However, I am now timing out at the last step of the update on this comment:
INFO:mtl.py:1165:process_overrides: Processing override ledgers
There shouldn't be any overrides to process since this is SV3. I am not sure if the process is actually hanging or if it is a process that would complete given a longer time allotment. I can dig in more, but that will cost some core hours and time, so I was hoping someone in this thread might be able to tell what is going on.
Compared to Adam, I'm going to be of no help, but am invested enough to ask: where is the script you call this? Where is the output? When I tried with this new branch I hit a new problem of not finding redshifts, but I assume that was a calling error.
@jalasker: When I run the MTL loop on the regular data using the branch that is now in master
, I receive your message about overrides, but there's no hanging and the task quickly runs to completion. Here's an example of the output for a test case:
[adamyers@cori07 mtl]$ run_mtl_loop DARK --mtldir /global/cscratch1/sd/adamyers/fakesurveyops2/mtl
INFO:mtl.py:2025:loop_ledger: running on SECONDARY ledger with obscon=DARK and survey=main
INFO:mtl.py:2070:loop_ledger: Update state for 33652 targets (the zcats also contain 669 skies with +ve TARGETIDs)
INFO:mtl.py:416:make_mtl: Ignoring 33145 z entries that aren't in the input target list (e.g. likely skies, secondaries-when-running-primary, primaries-when-running-secondary, etc.)
INFO:mtl.py:424:make_mtl: Ignoring a further 11 zcat entries with NODATA set
INFO:mtl.py:431:make_mtl: Ignoring a further 2 zcat entries with BAD_SPECQA or BAD_PETALQA set
WARNING:mtl.py:437:make_mtl: The size of the zcat has changed, so it won't be modified!
INFO:targets.py:866:calc_priority: 1163 scnd targets to be updated as secondary-only
INFO:targets.py:897:calc_priority: 0 more scnd targets allowed to update MWS primaries
INFO:targets.py:902:calc_priority: 1163 scnd targets to be updated in total
INFO:mtl.py:512:make_mtl: 1041 of 69187 targets have priority <=2, setting N_obs=0.
INFO:mtl.py:587:make_mtl: Done...t=0.2s
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:mtl.py:1165:process_overrides: Processing override ledgers
INFO:run_mtl_loop:52:<module>: MTL ledger directory: /global/cscratch1/sd/adamyers/fakesurveyops2/mtl/main/secondary/dark
INFO:run_mtl_loop:53:<module>: MTL tile file: /global/cscratch1/sd/adamyers/fakesurveyops2/mtl/scnd-mtl-done-tiles.ecsv
INFO:run_mtl_loop:54:<module>: Redshift tile file: /global/cscratch1/sd/adamyers/fakesurveyops2/ops/tiles-specstatus.ecsv
INFO:run_mtl_loop:55:<module>: Processed 8 new tiles, which are: [( 3140, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211009')
( 4817, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211003')
( 6367, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211007')
( 6538, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211009')
( 8077, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211009')
( 9355, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211008')
( 9630, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211008')
(10542, '2021-11-05T01:16:08+00:00', '1.3.0.dev5218', 'DARK', '20211008')]
INFO:mtl.py:2028:loop_ledger: running on PRIMARY ledger with obscon=DARK and survey=main
INFO:mtl.py:2070:loop_ledger: Update state for 33652 targets (the zcats also contain 669 skies with +ve TARGETIDs)
INFO:mtl.py:416:make_mtl: Ignoring 1845 z entries that aren't in the input target list (e.g. likely skies, secondaries-when-running-primary, primaries-when-running-secondary, etc.)
INFO:mtl.py:424:make_mtl: Ignoring a further 266 zcat entries with NODATA set
INFO:mtl.py:431:make_mtl: Ignoring a further 69 zcat entries with BAD_SPECQA or BAD_PETALQA set
WARNING:mtl.py:437:make_mtl: The size of the zcat has changed, so it won't be modified!
INFO:targets.py:866:calc_priority: 0 scnd targets to be updated as secondary-only
INFO:targets.py:897:calc_priority: 0 more scnd targets allowed to update MWS primaries
INFO:targets.py:902:calc_priority: 0 scnd targets to be updated in total
INFO:mtl.py:512:make_mtl: 28993 of 600146 targets have priority <=2, setting N_obs=0.
INFO:mtl.py:587:make_mtl: Done...t=3.2s
INFO:run_mtl_loop:52:<module>: MTL ledger directory: /global/cscratch1/sd/adamyers/fakesurveyops2/mtl/main/dark
INFO:run_mtl_loop:53:<module>: MTL tile file: /global/cscratch1/sd/adamyers/fakesurveyops2/mtl/mtl-done-tiles.ecsv
INFO:run_mtl_loop:54:<module>: Redshift tile file: /global/cscratch1/sd/adamyers/fakesurveyops2/ops/tiles-specstatus.ecsv
INFO:run_mtl_loop:55:<module>: Processed 8 new tiles, which are: [( 3140, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211009')
( 4817, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211003')
( 6367, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211007')
( 6538, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211009')
( 8077, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211009')
( 9355, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211008')
( 9630, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211008')
(10542, '2021-11-05T01:16:22+00:00', '1.3.0.dev5218', 'DARK', '20211008')]
The INFO:mtl.py:1165:process_overrides: Processing override ledgers
steps are nearly instantaneous. So, basically, I can't recreate your bug. Can you come up with a simple example I can try to run to debug?
If you want to try my example for (a fake copy of) the "real" data:
/global/cscratch1/sd/adamyers/fakesurveyops2
directory to your scratch space. Let's call that directory /global/cscratch1/sd/jlasker/fakesurveyops2
/global/cscratch1/sd/jlasker/fakesurveyops2/mtl/mtl-done-tiles.ecsv
and remove the last few dark tiles (so they will be re-run by MTL).$ZCAT_DIR
environment variable set to /global/cfs/cdirs/desi/spectro/redux/daily
.run_mtl_loop DARK --mtldir /global/cscratch1/sd/jlasker/fakesurveyops2/mtl
. Make absolutely sure to pass --mtldir
so as not to corrupt your existing $MTL_DIR
directory!I ran my own job again and it wasn't hanging anymore. I think the original hangup was a result of some parallel job taking too long or an I/O collision causing an unexpected delay. (@michaelJwilson I will eventually have documentation up on the wiki with the code that I'm running, but for now I'm not even certain the current version of the altmtl code is in the LSS github. The script is /global/cfs/cdirs/desi/survey/catalogs/SV3/LSS/altmtl/debug_jl/dateLoopAltMTL.sh which calls runAltMTLParallel.py in the same directory.).
However, once the code was running again, it showed some additional errors regarding being unable to find the keyword "NUMOVERRIDE" in one of the SV3 MTLs:
""" Traceback (most recent call last): File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/global/cfs/cdirs/desi/survey/catalogs/SV3/LSS/altmtl/debug_jl/runAltMTLParallel.py", line 65, in procFunc amt.loop_alt_ledger(obscon, survey = survey, mtldir = mtldir, zcatdir = zcatdir, altmtlbasedir = altmtlbasedir, ndirs = ndirs, numobs_from_ledger = numobs_from_ledger,secondary = secondary, getosubp = getosubp, quickRestart = quickRestart, multiproc = multiproc, nproc = nproc, singleDate = singleDate, redoFA = redoFA) File "/global/homes/j/jlasker/.local/desicode/LSS/py/LSS/SV3/altmtltools.py", line 607, in loop_alt_ledger update_ledger(althpdirname, altZCat, obscon=obscon.upper(), File "/global/homes/j/jlasker/.local/lib/python3.8/site-packages/desitarget/mtl.py", line 1472, in update_ledger overmtl = process_overrides(overfn) File "/global/homes/j/jlasker/.local/lib/python3.8/site-packages/desitarget/mtl.py", line 1175, in process_overrides mtl["NUMOVERRIDE"] -= 1 File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/site-packages/astropy/table/table.py", line 1602, in getitem return self.columns[item] File "/global/common/software/desi/cori/desiconda/20200801-1.4.0-spec/conda/lib/python3.8/site-packages/astropy/table/table.py", line 239, in getitem return OrderedDict.getitem(self, item) KeyError: 'NUMOVERRIDE' """
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/global/cfs/cdirs/desi/survey/catalogs/SV3/LSS/altmtl/debug_jl/runAltMTLParallel.py", line 97, in
Adding additional print/log statements into the process_override function and adjacent functions, I was able to find that it was looking for a non-override MTL (e.g. /global/cfs/cdirs/desi/survey/catalogs/SV3/LSS/altmtl/debug_jl/alt_mtls_testAdamOverrideFix_64dirs//Univ056/sv3/dark/sv3mtl-dark-hp-2594.ecsv) in the process_override function.
I added a check to only try to run process_override within update_ledger if the MTL file contained "override" as part of the file name and that seemed to get it to work.
I think the optimal way to fix this would probably be to modify find_mtl_file_format_from_header with the override option so that it only finds override MTLs, but I am not sure how to do that.
@geordie666 I re-pulled the desitarget code after the recent updates and I'm still getting errors. I also cannot get your example to work. It fails asking for the field "ARCHIVE_DATE"
jlasker@cori05:/global/cscratch1/sd/jlasker/fakesurveyops2/mtl> run_mtl_loop DARK --mtldir /global/cscratch1/sd/jlasker/fakesurveyops2/mtl/
WARNING: leap-second auto-update failed due to the following exception: RuntimeError('Cache is locked after 5.03 s. This may indicate an astropy bug or that kill -9 was used. If you want to unlock the cache remove the directory /global/homes/j/jlasker/.astropy/cache/download/py3/lock.') [astropy.time.core]
INFO:mtl.py:2303:loop_ledger: running on SECONDARY ledger with obscon=DARK and survey=main
Traceback (most recent call last):
File "/global/homes/j/jlasker/.local/bin/run_mtl_loop", line 52, in
We updated the mtl-done-tiles.ecsv
and scnd-mtl-done-tiles.ecsv
files to have a new column ARCHIVEDATE
. This is to facilitate reprocessing of tiles, as the spectroscopic pipeline is now storing the redshift files in the directory $ZCAT_DIR/tiles/archive/ARCHIVEDATE
rather than in the cumulative
directory.
The first step in fixing the loop for the alternate MTLs would be to copy across the "official" MTL done files from:
/global/cfs/cdirs/desi/survey/ops/surveyops/trunk/mtl/mtl-done-tiles.ecsv
/global/cfs/cdirs/desi/survey/ops/surveyops/trunk/mtl/scnd-mtl-done-tiles.ecsv
(which contain the ARCHIVEDATE
column).
Note that I'm currently working on a PR to complicate this yet further by allowing reprocessed tiles to be passed through MTL. So, you might not want to delve into this too deeply until all of the new MTL infrastructure is in place (until then you could use an earlier tag of desitarget
, perhaps?).
@jalasker: I'm just getting back to this after the various extra MTL updates.
Do you want to test the current branch and see if we need further updates to facilitate working with the alternate MTLs?
Also, should we close this PR and pick up any needed work in a new branch?
Sorry, I was traveling home today. I will test the current branch tomorrow and let you know if we still need a fix for the override code with SV3.
Running the tests you described on the MTLs you described/linked above works for me after grabbing the updated files.
It is also not failing on the alternate MTLs but is not updating any targets.
I believe that I have some updates to make to the alt MTL loop, possibly passing some additional options that weren't present in the original loop_ledger function.
@jalasker: Great, glad this works in the short-term. Is it OK to close this PR and work on any other updates in a new branch spawned from the current master branch?
So I think it is fine to work on the updates in a new branch with the latest code version.
However, I am still getting the errors about looking for "NUMOVERRIDE" even with the latest code. (The reason that it didn't fail last night is that I forgot to re-remove the tiles that I originally removed from the mtl-tiles-done file that I copied from your fakesurveyops2 directory. Once I removed those tiles and it actually tried to update the MTLs it failed with the same error.)
OK, let me close this and open an issue instead. I can then work on fixing the NUMOVERRIDE
problem in a separate branch.
Stops adding the OVERRIDE keyword to the meta dictionary for the SV3(and 2 and 1) and CMX MTLs since the presence of that keyword causes problems downstream in the alternate MTL code.
I would prefer to wait on merging this pull request until I have confirmed that it doesn't create undesired effects in the final alternate MTL bitweight products, but I can confirm that it does have the desired effect in the initial and intermediate stages of the alternate MTL pipeline.