desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
36 stars 23 forks source link

Missing folders / files in the daily/tiles/cumulative #1933

Open araichoor opened 1 year ago

araichoor commented 1 year ago

if I didn t mistake, it looks like some folders or tile-qa files are missing in the daily/tiles/cumulative folder. this issue is a "enhanced" follow-up of https://desisurvey.slack.com/archives/C01HNN87Y7J/p1670109389351019.

from spot-checking https://data.desi.lbl.gov/desi/spectro/redux/daily/run/dashboard/zdashboard.html for few of the cases, it is correctly reported there that files are missing. we apparently have not been careful in monitoring that zdashboard: maybe we should verify that zdashboard page before announcing a night processed?

I did check for nights since 20220901, and account for the bad guiding blacklist from https://github.com/desihub/desisurveyops/issues/74.

I find that the following folders are missing:

# FOLDER LASTSTEPS
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/24475/20221009 all=2
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/22431/20221106 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25416/20221112 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25558/20221112 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/43006/20221112 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25414/20221113 all=2
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25422/20221113 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/11414/20221114 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/7952/20221115 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/22973/20221124 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/42014/20221124 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/3540/20221125 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/43061/20221125 all=1

and the following tileqa files are missing:

# FN LASTSTEPS
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/43073/20221121/tile-qa-43073-thru20221121.fits all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/43072/20221123/tile-qa-43072-thru20221123.fits all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/43073/20221123/tile-qa-43073-thru20221123.fits all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83205/20221130/tile-qa-83205-thru20221130.fits all=2
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83206/20221130/tile-qa-83206-thru20221130.fits all=2
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83207/20221130/tile-qa-83207-thru20221130.fits all=2
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83208/20221130/tile-qa-83208-thru20221130.fits all=2
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83209/20221130/tile-qa-83209-thru20221130.fits all=2
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/43072/20221202/tile-qa-43072-thru20221202.fits all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/43017/20221206/tile-qa-43017-thru20221206.fits all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83200/20221209/tile-qa-83200-thru20221209.fits all=3
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83201/20221209/tile-qa-83201-thru20221209.fits all=3
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83202/20221209/tile-qa-83202-thru20221209.fits all=2
araichoor commented 1 year ago

if of any usefulness in the future, here are the lines of code I used:

first grab all exposures with TILEID > 0 since Sept. 2022:

proddir = "/global/cfs/cdirs/desi/spectro/redux/daily"
ds = []
for month in [202209, 202210, 202211, 202212]:
    mydir = os.path.join(proddir, "exposure_tables", str(month))
    fns = sorted(glob(os.path.join(mydir, "exposure_table_{}??.csv".format(month))))
    for fn in fns:
        d = Table.read(fn)
        d = d[d["TILEID"] > 0]
        for key in ["BADCAMWORD", "BADAMPS"]:
            d[key] = d[key].astype(str)
        ds.append(d)
d = vstack(ds)

then read the 20221031-20221112 bad guiding exposures (and assess that those have LASTSTEP=="ignore":

badexps = Table.read("/global/cfs/cdirs/desi/users/schlafly/misc/badguidingtileexp.ecsv")
# 
for i in range(len(badexps)):
    j = np.where(d["EXPID"] == badexps["EXPID"][i])[0][0]
    if d["LASTSTEP"][j] != "ignore":
        print(d["EXPID"][j], d["NIGHT"][j], d["LASTSTEP"][j], d["COMMENTS"][j])

this returns those two exposures, don t know if that matters:

152853 20221110 skysub dither seq|
152855 20221110 skysub dither seq|

then reads the exposures-daily.csv exposures with EFFTIME_SPEC>0:

e = Table.read("/global/cfs/cdirs/desi/spectro/redux/daily/exposures-daily.csv")
e = e[(e["NIGHT"] > 20220900) & (e["EFFTIME_SPEC"] > 0)]

and now I check for each (night, tileid) if there is a folder and a tile-qa*fits, if we expect one:

for night in np.unique(e["NIGHT"]):
    for tileid in np.unique(e["TILEID"][e["NIGHT"] == night]):
        # exposures for that night, tileid)
        expids = e["EXPID"][(e["TILEID"] == tileid) & (e["NIGHT"] == night)]
        # are all the exposures bad guiding? (if yes, we don t expect the folder to exist)
        allbad = np.in1d(expids, badexps["EXPID"]).sum() == len(expids)
        # recording the laststep, to verify that those don t all are "ignore"
        laststeps, counts = np.unique(d["LASTSTEP"][np.in1d(d["EXPID"], expids)], return_counts=True)
        all_laststeps = ",".join(["{}={}".format(laststep, count) for laststep, count in zip(laststeps, counts)])
        # folder and tileqa file we check
        mydir = os.path.join(proddir, "tiles", "cumulative", str(tileid), str(night))
        fn = os.path.join(mydir, "tile-qa-{}-thru{}.fits".format(tileid, night))
        if not allbad:
            if not os.path.isdir(mydir):
                print(mydir, all_laststeps)
            else:
                if not os.path.isfile(fn):
                    print(fn, all_laststeps)
marcelo-alvarez commented 1 year ago

from spot-checking https://data.desi.lbl.gov/desi/spectro/redux/daily/run/dashboard/zdashboard.html for few of the cases, it is correctly reported there that files are missing. we apparently have not been careful in monitoring that zdashboard: maybe we should verify that zdashboard page before announcing a night processed?

@araichoor thanks for this.

As I understand it (@sbailey or @akremin could provide context), the sufficient condition for announcing a night as processed is that non-backup tiles have been processed. Missing tiles that are backup or tertiary do not by themselves indicate we have not been careful in monitoring zdashboard, but rather that they have not been considered necessary to announce a night as processed. It would be helpful if you could provide a list of missing tiles that have FA_PRGRM values of 'dark' or 'bright', but not 'backup', 'tertiary*', etc.

araichoor commented 1 year ago

sure. dark tiles are: 1000 <= TILEID < 20000, and bright tiles are: 20000 <= TILEID < 40000.

I think I remember that some backup tileqa were identified as missing, and it was said ok to proceed. for the tertiary* tiles, I think we d still want those to have the tileqa files, as those are still informative to assess the validity of the information.

araichoor commented 1 year ago

I m (still) digging in the /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/3540/20221125 case. I report in case it s useful -- I ll stop here for tonight.

could it be that the folder has been deleted as part of some re-processing operations?

because it did exist when this file was generated:

-rw-r----- 1 desi desi 199K Dec  1 12:36 /global/cfs/cdirs/desi/spectro/redux/daily/nightqa/20221125/petalnz-20221125.pdf

as this tile is here on p.18.

(and, in the case that the zdashboard always scans all nights, it could be that it was ok at the processing time, but then flagged the missing files afterwards, once those would have been deleted; I m just trying to guess...)

araichoor commented 1 year ago

small investigation report about the missing m31 tileqa files (TILEID=83200-83209, FAPRGRM=tertiary13,tertiary14):

I d say the issue is because some per-exposure petal=7 data from 20221119-20 are not there. this state is ambiguous, because the 20221119-20 data are "bonus" ones; they were done only with 4 petals; and the nightlogs report that SP7 was problematic. so, it s a bit of a peculiar case...

if useful, some diagnosis commands below.

from a quick look at the log, it looks like it s the b7 and r7 cframe files which are the culprit, e.g.:

raichoor@cori03:~> cat /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83200/20221209/logs/tile-qa-83200-thru20221209.log
================ Start of Process 0 ================
ERROR:util.py:85:runcmd: missing input /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83200/20221209/coadd-7-83200-thru20221209.fits
ERROR:util.py:85:runcmd: missing input /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83200/20221209/redrock-7-83200-thru20221209.fits
CRITICAL:util.py:96:runcmd: FAILED missing required inputs: desispec.scripts.tileqa.main(['-g', 'cumulative', '-n', '20221209', '-t', '83200'])
================= End of Process 0 =================

the spectra-7-83200-thru20221209.fits.gz file is indeed missing, because no {br}7 cframes on the 20221119 exposures:

raichoor@cori03:~> cat /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/83200/20221209/logs/spectra-7-83200-thru20221209.log 
================ Start of Process 0 ================
ERROR:util.py:85:runcmd: missing input /global/cfs/cdirs/desi/spectro/redux/daily/exposures/20221119/00153843/cframe-b7-00153843.fits.gz
ERROR:util.py:85:runcmd: missing input /global/cfs/cdirs/desi/spectro/redux/daily/exposures/20221119/00153843/cframe-r7-00153843.fits.gz
ERROR:util.py:85:runcmd: missing input /global/cfs/cdirs/desi/spectro/redux/daily/exposures/20221119/00153844/cframe-b7-00153844.fits.gz
ERROR:util.py:85:runcmd: missing input /global/cfs/cdirs/desi/spectro/redux/daily/exposures/20221119/00153844/cframe-r7-00153844.fits.gz
ERROR:util.py:85:runcmd: missing input /global/cfs/cdirs/desi/spectro/redux/daily/exposures/20221119/00153855/cframe-b7-00153855.fits.gz
ERROR:util.py:85:runcmd: missing input /global/cfs/cdirs/desi/spectro/redux/daily/exposures/20221119/00153855/cframe-r7-00153855.fits.gz
[...]
araichoor commented 1 year ago

the other missing tileqa files are for main/backup tiles (40000 <= TILEID < 60000), so clearly not a priority. so I d say the higher priority is to understand+fix the missing folders for the main bright/dark tiles.

marcelo-alvarez commented 1 year ago

@araichoor update on the missing dark and bright tile directories:

/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/24475/20221009 all=2

marked with "massive cosmic shower" during exp 147196 in the dashboards, but the exposure table still shows laststep = all.

/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/22431/20221106 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25416/20221112 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25558/20221112 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25414/20221113 all=2
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25422/20221113 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/11414/20221114 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/7952/20221115 all=1

These are tiles with flagged bad guidance exposures on different nights than the ones listed, so should not have been deleted. Still investigating.

/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/22973/20221124 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/3540/20221125 all=1 

These two are likely to have been accidentally deleted during batch reprocessing of 11/21-11/27 with updated desi_spectro_dark models, some time on or around 12/1. They can be regenerated, but this will have to wait until Perlmutter is back up.

marcelo-alvarez commented 1 year ago
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/22431/20221106 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25416/20221112 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25558/20221112 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25414/20221113 all=2
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25422/20221113 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/11414/20221114 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/7952/20221115 all=1

These are tiles with flagged bad guidance exposures on different nights than the ones listed, so should not have been deleted. Still investigating.

Update on the missing bad-guidance-associated tile directories:

I used desi_purge_tilenight to reprocess tiles with bad exposures, as described in desisurveyops/74, and this (correctly) removed all corresponding data for exposures on the same tile on or after the bad exposures.

However, I did not reprocess those tiles on the nights after, because 1) I (erroneously) did not re-run desi_run_night for 11/13, 11/14, and 11/15, on which there were no bad guidance exposures; and 2) for 11/06 and 11/12, while desi_purge_tilenight removes all data associated with the tile on or after the night provided, it only modifies the processing tables for the night, not the subsequent nights, so even for (1) it would not have helped to have run desi_run_night on those nights.

I will submit the seven tiles above on cori now and update when done.

I will wait on the other three, two lost during dark-associated reprocessing and the cosmic ray shower one, until perlmutter is back up and I get feedback from survey ops, respectively.

araichoor commented 1 year ago

sounds great! thanks a lot for the detective work :)

if useful, about the lower-priority backup tiles: this failure $DESI_ROOT/spectro/redux/daily/tiles/cumulative/43072/20221123/ is identified in slack here : https://desisurvey.slack.com/archives/C011JS5GW5U/p1669775558895179. I notice that the folder exists, but is fully empty (except log files); my code above only identified the missing tile-qa*.fits file, but the problem is rooted to missing cframe files, the slack message says. maybe it also explains 43073 and 43017, don t know...

marcelo-alvarez commented 1 year ago

Yes, it is not that unusual it seems to leave the backup tiles out of daily processing for the sake of expediency, and it's hard to imagine chasing every single one back over many months, so I'm inclined to continue with the tradition of not trying to save them once they've been 'thrown overboard'. They will get caught in the big productions, right?

araichoor commented 1 year ago

for what is worth, I ve also browsed the https://data.desi.lbl.gov/desi/spectro/redux/daily/run/dashboard/zdashboard.html for Sep/Oct/Nov/Dec 2022, checking what is reported missing w.r.t. my list.

I find also this case that I didn t report with my check:

it s a dark tile exposure, so I d put it to higher priority to understand what s happening there.

araichoor commented 1 year ago

"... it seems to leave the backup tiles out of daily processing ... " => it could be, I ve not checked; but my status page tripped on it, which didn t happen before, so I suspect it s not a "usual" failure mode here. I agree it s not high-priority, and I can with no problem put a hack in my status code to ignore those cases, but I feel it d be nice to understand what s happening here (and ideally fix it!).

marcelo-alvarez commented 1 year ago
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/22431/20221106 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25416/20221112 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25558/20221112 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25414/20221113 all=2
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/25422/20221113 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/11414/20221114 all=1
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/7952/20221115 all=1

@araichoor the tiles above have finished processing and appear normal in the zdashboard. Please have a look, thanks.

marcelo-alvarez commented 1 year ago

for what is worth, I ve also browsed the https://data.desi.lbl.gov/desi/spectro/redux/daily/run/dashboard/zdashboard.html for Sep/Oct/Nov/Dec 2022, checking what is reported missing w.r.t. my list.

I find also this case that I didn t report with my check:

* TILEID=7946 on 20221126: no folder in `tiles/cumulative`; there is one exposure `00154927`, which has `LASTSTEP="all"`; but the processing stopped at the frame and fiberflatexp files; this exposure does not appear in the `exposures-daily.fits`.

it s a dark tile exposure, so I d put it to higher priority to understand what s happening there.

Thanks @araichoor. I noted this issue when reporting 11/26 processing as done, and it was solved in https://github.com/desihub/desispec/pull/1927, but did not get propagated through to reprocessing 7946 before Perlmutter went down for maintenance. I am trying not to mix Perlmutter and Cori processing on any given night, so this will have to wait until Perlmutter is back up. Please don't hesitate to remind me if you notice it missing again.

araichoor commented 1 year ago

great 7946 is already on the radar, and ready to be fixed!

and thanks for the reprocessing; as far as I can tell, looks good.

if I m correct, these are the remaining main bright/dark (I am ignoring backup tiles here):

akremin commented 1 year ago

Tile 24475 is associated with https://github.com/desihub/desisurveyops/issues/69 and is on my radar. I'll clean that up in my next surveyops ticket sweep.

marcelo-alvarez commented 1 year ago
* `cumulative/22973/20221124` => accidentally deleted, will be re-processed when perlmutter is back up;

* `cumulative/3540/20221125` => accidentally deleted, will be re-processed when perlmutter is back up;

* `cumulative/7946/20221126` => will be re-processed when perlmutter is back up.

Thanks for the summary. Yes, I will reprocess these when Perlmutter is back up.

I intend to do a general cleanup of backup tiles in Oct/Nov/Dec that should have been marked as bad and/or were accidentally deleted, etc. (i.e. most if not all the ones mentioned above, and others), but not to report details of that process here. @araichoor I hope this works for you, and thanks again for helping to find these cases.