desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
35 stars 24 forks source link

Tileqa crashing due to tsnr2 fiberqa nan's #1978

Open akremin opened 1 year ago

akremin commented 1 year ago

Short version

Three tiles on one night, 20230119, are getting NaN's assigned to their fiber TSNR2 values, which hasn't happened before or since. Anand has verified that this has not occured in Iron with the current code tags, so thankfully it seems isolated to these data.

Long version:

We're seeing a new tileqa failure for 20230119 processing that hadn’t previously been an issue. For a subset of tiles it successfully writes out the tileqa*fits file but crashes when generating the tileqa*png. Oddly it only happened on 3 tiles out of 31, and is reproducible on resubmission and in deleting the tileqa*fits file and attempting to regenerate both tileqa files. The affected tiles are two bright program and one backup.

Traceback:

ERROR:util.py:158:runcmd: File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/desispec/main/py/desispec/scripts/tileqa.py", line 113, in _wrap_make_tile_qa_plot
    figfile = make_tile_qa_plot(qafitsfile, specprod_dir)
ERROR:util.py:158:runcmd: File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/desispec/main/py/desispec/tile_qa_plot.py", line 1265, in make_tile_qa_plot
    exps = get_expids_efftimes(tileqafits, prod)
ERROR:util.py:158:runcmd: File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/desispec/main/py/desispec/tile_qa_plot.py", line 958, in get_expids_efftimes
    d["QA_EFFTIME_SPEC"][i] = fits.getheader(fn, "FIBERQA")["EFFTIME"]
ERROR:util.py:158:runcmd: File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/site-packages/astropy/io/fits/header.py", line 157, in __getitem__
    card = self._cards[self._cardindex(key)]
ERROR:util.py:158:runcmd: File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/site-packages/astropy/io/fits/header.py", line 1754, in _cardindex
    raise KeyError(f"Keyword {keyword!r} not found.")
ERROR:util.py:158:runcmd: KeyError: "Keyword 'EFFTIME' not found."

This also happened for one tile last night, 20230120, but it was tile 21513 that failed on 20230119 so it is unclear if the failure is due to issues in the 20230119 files or not.

Example log: /global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/26064/20230119/logs/tile-qa-26064-thru20230119.log

Fits files:

/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/21513/20230119/tile-qa-21513-thru20230119.fits
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/26064/20230119/tile-qa-26064-thru20230119.fits
/global/cfs/cdirs/desi/spectro/redux/daily/tiles/cumulative/42395/20230119/tile-qa-42395-thru20230119.fits
akremin commented 1 year ago

The following was tracked down by Anand and posted on Slack (https://desisurvey.slack.com/archives/C01HNN87Y7J/p1674316064654409):

Anand's comments:

I think the culprit is cframe-z0-00163835.fits.gz, which has some nan in TSNR2. details: in the exposure-qa log file, the code warns it cannot write the EFFTIME header keyword (which is then missing for tile_qa_plot):

raichoor@cori07:~> grep WARNING /global/cfs/cdirs/desi/spectro/redux/daily/run/scripts/night/20230119/poststdstar-20230119-00163835-a0123456789-66151195.log | grep EFFTIME
WARNING: Attribute `EFFTIME` of type <class 'numpy.float32'> cannot be added to FITS Header - skipping [astropy.io.fits.convenience]

in cframe-z0-00163835.fits.gz, two rows have nan for TSNR2_BGS_Z (actually for all TSNR2* columns):

>>> fn = "/global/cfs/cdirs/desi/spectro/redux/daily/exposures/20230119/00163835/cframe-z0-00163835.fits.gz"
KeyboardInterrupt
>>> d = Table.read("/global/cfs/cdirs/desi/spectro/redux/daily/exposures/20230119/00163835/cframe-z0-00163835.fits.gz", "SCORES")
>>> np.where(~np.isfinite(d["TSNR2_BGS_Z"]))[0]
array([0, 2])

if that helps, I confirm that only 3 {camera,petal} from 20230119 have some nans in EFFTIME, and it s corresponding to the three tiles you report.

from an interactive node:

def _process_fn(fn):
    camera = fits.getheader(fn,0)["CAMERA"][0]
    key = "TSNR2_LRG_{}".format(camera.upper()) # I just check TSNR2_LRG, assuming the issue is independent of the TNSR2 column
    xs = fitsio.read(fn, "SCORES", columns=[key])[key]
    if not np.isfinite(xs.mean()):
        print(os.path.basename(fn), ",".join(np.where(~np.isfinite(xs))[0].astype(str)))

import multiprocessing
fns = sorted(glob("/global/cfs/cdirs/desi/spectro/redux/daily/exposures/20230119/*/cframe-??-*.fits.gz"))
pool = multiprocessing.Pool(processes=64)                                                                                                                          
with pool:
    _ = pool.map(_process_fn, fns)

returns:

cframe-z0-00163835.fits.gz 0,2
cframe-z0-00163839.fits.gz 499
cframe-z7-00163860.fits.gz 1

though obviously small stats, all three are in the z-camera, with fibers on the ~edge of the spectrograph. ps: checking for all Jan. 2023 files only returns those three files.

schlafly commented 1 year ago

Note---because this affects the offline pipeline EFFTIMEs, this affects what tiles we actually observe. AP one night revealed that this was actually causing NaNs to go into the estimated times we send to the ICS; we got lucky we did not try to observe one of these tiles. AP now zeros exposures with NaN EFFTIMEs, but it's not clear that is the right choice.

All to say that this issue is more urgent than I had originally appreciated.

araichoor commented 1 year ago

for completeness, below is the list of 2022/2023 cframe files in the daily folder with some nans in TSNR2_LRG (it should be true for any other TSNR quantity, I suspect). this a taken from the discussion here: https://desisurvey.slack.com/archives/C01HNN87Y7J/p1677608948897289.

that always happens for spectros. z0 or z7, and for fibers on the edge of the spectrograph. and only backup tiles, except two cases (bright tiles).

# NIGHT CFRAME TILEID ROWS
#
20220111    cframe-z7-00118175.fits.gz  41253   0
20220111    cframe-z7-00118177.fits.gz  41260   0
20220112    cframe-z0-00118314.fits.gz  42957   0,1
20220113    cframe-z0-00118501.fits.gz  42994   0,2
20220113    cframe-z7-00118502.fits.gz  43017   1,4
20220113    cframe-z7-00118504.fits.gz  43006   0
20220113    cframe-z7-00118505.fits.gz  43009   0,2
20220206    cframe-z0-00121619.fits.gz  40237   0,2
20220206    cframe-z7-00121619.fits.gz  40237   1
20220227    cframe-z7-00124230.fits.gz  41484   0
#
20230110    cframe-z0-00162842.fits.gz  41484   2
20230119    cframe-z0-00163835.fits.gz  26064   0,2
20230119    cframe-z0-00163839.fits.gz  21513   499
20230119    cframe-z7-00163860.fits.gz  42395   1
schlafly commented 1 year ago

Note that this is leading to OBSSTATUS = unknown for this bright tile that should properly be finished and is preventing MTL updates on this tile.

araichoor commented 1 year ago

we had another event like that last night 20230525, see below. seven tiles are affected; as before, it s always the ~first or ~last fibers of the z0 and z7 spectrographs, only.

@julienguy: would you have any suggestion of the root cause of that? or where to look at more closely to better understand?

I did check that there are so far no occurences in 2023 other than this 20230525 set and the ones reported above (20230110 and 20230119).

# NIGHT CFRAME TILEID ROWS
20230525    cframe-z0-00182159.fits.gz  21812   499
20230525    cframe-z0-00182154.fits.gz  24908   0
20230525    cframe-z0-00182149.fits.gz  6858    1
20230525    cframe-z0-00182143.fits.gz  83365   1
20230525    cframe-z7-00182166.fits.gz  4566    1
20230525    cframe-z0-00182144.fits.gz  22666   1
20230525    cframe-z7-00182151.fits.gz  8414    1
sbailey commented 1 year ago

PR #2053 fixed the underlying problem, but I'm reopening this ticket until we've reprocessed the old tiles that were impacted.

sbailey commented 1 year ago

For the record I reprocessed the 3 remaining impacted tiles with:

#- Purge all nights of the impacted tiles
desi_purge_tilenight -t 26064 -n 20230119 --not-dry-run
desi_purge_tilenight -t 26064 -n 20230130 --not-dry-run

desi_purge_tilenight -t 21513 -n 20230119 --not-dry-run
desi_purge_tilenight -t 21513 -n 20230120 --not-dry-run
desi_purge_tilenight -t 21513 -n 20230201 --not-dry-run

desi_purge_tilenight -t 42395 -n 20230119 --not-dry-run
desi_purge_tilenight -t 42395 -n 20230603 --not-dry-run

#- Reprocess in batches
export TILEID=26064 && export NIGHT=20230119
export TILEID=21513 && export NIGHT=20230119
export TILEID=42395 && export NIGHT=20230119

export TILEID=26064 && export NIGHT=20230130
export TILEID=21513 && export NIGHT=20230120
export TILEID=42395 && export NIGHT=20230603

export TILEID=21513 && export NIGHT=20230201

desi_run_night -n $NIGHT --tiles $TILEID --z-submit-types=cumulative \
    --laststeps="all,skysub,fluxcal" --all-tiles --append-to-proc-table \
    --ignore-proc-table-failures \
    &>> ${HOME}/daily/logfiles/daily-${NIGHT}_4.log

#- Regenerate tiles, exposures files
for NIGHT in 20230130 20230201 20230603; do
    /global/homes/d/desi/pdaily/bin/pdaily_tsnr_nightqa ${NIGHT}
done

#- post-facto also updated 20230119 because that is the night that had the NaN
/global/homes/d/desi/pdaily/bin/pdaily_tsnr_nightqa 20230119

#- confirm by hand there are no more NaN
t = Table.read('tiles-daily.csv')
np.sum(np.isnan(t['EFFTIME_SPEC']))
araichoor commented 1 year ago

thanks! one remark though: if I m correct, the per-exposure file (I m checking the cframe files) have not been updated, except for 20230119. i.e. if I rerun the code I posted earlier in this thread for the nights I list, the exposures from 20230119 are not reported anymore (great, those have been fixed; on Jun. 21), but the other ones are still reported; is that intended?

# NIGHT CFRAME TILEID ROWS
#
20220111    cframe-z7-00118175.fits.gz  41253   0
20220111    cframe-z7-00118177.fits.gz  41260   0
20220112    cframe-z0-00118314.fits.gz  42957   0,1
20220113    cframe-z0-00118501.fits.gz  42994   0,2
20220113    cframe-z7-00118502.fits.gz  43017   1,4
20220113    cframe-z7-00118504.fits.gz  43006   0
20220113    cframe-z7-00118505.fits.gz  43009   0,2
20220206    cframe-z0-00121619.fits.gz  40237   0,2
20220206    cframe-z7-00121619.fits.gz  40237   1
20220227    cframe-z7-00124230.fits.gz  41484   0
#
20230110    cframe-z0-00162842.fits.gz  41484   2
schlafly commented 1 year ago

The tileqa and tiles-specstatus for these now look good and I am tempted to close, but I haven't tried to diagnose what Anand is describing above. @sbailey , have you understood the above and think we can close?

akremin commented 1 year ago

@araichoor The fix Stephen implemented catches these at the exposure-qa and tile-qa level, which are both run after cframes are generated. So I don't think this will ever fix the Nan's in the cframes. So now I am curious why you seem to find that it is resolved for one of the nights. @sbailey is away this week, so this will likely have to wait for next week.

araichoor commented 1 year ago

don t know the reason, but all the per-exposures files have been re-generated on Jun. 21 around 14:40 for the three listed/flagged exposures from 20230119, i.e.:

/global/cfs/cdirs/desi/spectro/redux/daily/exposures/20230119/00163835/
/global/cfs/cdirs/desi/spectro/redux/daily/exposures/20230119/00163839/
/global/cfs/cdirs/desi/spectro/redux/daily/exposures/20230119/00163860/

and those new cframe files don t have NaN's anymore; so maybe something changed in the daily pipeline / calibration files, I don t know...

sbailey commented 1 year ago

When I started working on this ticket, I spot checked some of the files that Anand had identified, and found that they had already been reprocessed after his original post (as he mentioned in the previous comment). Rather than check them all, I switched to "working backwards" and reprocessed only tiles that had NaN in the tiles file which was messing up QA. I don't think it is worth processing daily cframes that aren't causing NaN problems in the final tiles files, but if I'm misunderstanding something please let me know and I can dig into it more.

akremin commented 1 year ago

Thanks, Stephen. I agree I don't think we need to worry about the cframes in daily. I think the confusion was that Anand expected them to be resolved in the cframes as part of the solution, but I think you intercept it at the QA level instead. If I understand correctly this means the cframes in future re-processing runs will also have these NaN's? That is slightly less desirable, but also isn't the end of the world and isn't worth a huge amount of effort to patch.

On Tue, Jul 11, 2023, 12:02 PM Stephen Bailey @.***> wrote:

When I started working on this ticket, I spot checked some of the files that Anand had identified, and found that they had already been reprocessed after his original post (as he mentioned in the previous comment). Rather than check them all, I switched to "working backwards" and reprocessed only tiles that had NaN in the tiles file which was messing up QA. I don't think it is worth processing daily cframes that aren't causing NaN problems in the final tiles files, but if I'm misunderstanding something please let me know and I can dig into it more.

— Reply to this email directly, view it on GitHub https://github.com/desihub/desispec/issues/1978#issuecomment-1631360873, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBM5OZOVOTNWWIFHX3XGFLXPWPK5ANCNFSM6AAAAAAUENKKOU . You are receiving this because you were assigned.Message ID: @.***>

sbailey commented 1 year ago

@akremin PR #2053 also fixed the underlying issue so that any new cframe processing should not have NaNs. Details: previously traces off the edge of the CCD were treated as having readnoise=0 for those wavelengths, which led to a NaN; PR #2053 updated that to treat them as effectively very large readnoise instead.

@araichoor I don't think I understand your table of remaining cframes with NaN:

# NIGHT CFRAME TILEID ROWS
#
20220111    cframe-z7-00118175.fits.gz  41253   0
20220111    cframe-z7-00118177.fits.gz  41260   0
20220112    cframe-z0-00118314.fits.gz  42957   0,1
...

e.g. spot checking that last one:

scores = Table.read('exposures/20220112/00118314/cframe-z0-00118314.fits.gz', 'SCORES')
np.any(np.isnan(scores['TSNR2_LRG_Z']))
--> False

What HDU / column or keyword still has a NaN?

araichoor commented 1 year ago

@sbailey: Table.read() masks those values (what a pain).

>>> fn = '/global/cfs/cdirs/desi/spectro/redux/daily/exposures/20220112/00118314/cframe-z0-00118314.fits.gz'

>>> scores = Table.read(fn, 'SCORES')
>>> np.any(np.isnan(scores['TSNR2_LRG_Z']))
False
>>> scores['TSNR2_LRG_Z'][[0,1]]
<MaskedColumn name='TSNR2_LRG_Z' dtype='float64' length=2>
--
--

>>> scores = Table(fitsio.read(fn, 'SCORES'))
>>> np.any(np.isnan(scores['TSNR2_LRG_Z']))
True
>>> scores['TSNR2_LRG_Z'][[0,1]]
<Column name='TSNR2_LRG_Z' dtype='float64' length=2>
nan
nan
sbailey commented 1 year ago

Argh! That "feature" of astropy tables continues to bring pain and suffering. Thanks for the clarification.

weaverba137 commented 1 year ago

But there are workarounds in recent versions of Astropy: https://docs.astropy.org/en/stable/api/astropy.table.Table.html#astropy.table.Table, masked=False

araichoor commented 11 months ago

for info, we tripped again on this for tileid=42957 on 118314 from 20220112 (https://github.com/desihub/desispec/issues/2095)

this exposure was listed in my 2022/2023 list from https://github.com/desihub/desispec/issues/1978#issuecomment-1449073859, which I re-copy here:

# NIGHT CFRAME TILEID ROWS
#
20220111    cframe-z7-00118175.fits.gz  41253   0
20220111    cframe-z7-00118177.fits.gz  41260   0
20220112    cframe-z0-00118314.fits.gz  42957   0,1   # <= 
20220113    cframe-z0-00118501.fits.gz  42994   0,2
20220113    cframe-z7-00118502.fits.gz  43017   1,4
20220113    cframe-z7-00118504.fits.gz  43006   0
20220113    cframe-z7-00118505.fits.gz  43009   0,2
20220206    cframe-z0-00121619.fits.gz  40237   0,2
20220206    cframe-z7-00121619.fits.gz  40237   1
20220227    cframe-z7-00124230.fits.gz  41484   0
#
20230110    cframe-z0-00162842.fits.gz  41484   2
20230119    cframe-z0-00163835.fits.gz  26064   0,2
20230119    cframe-z0-00163839.fits.gz  21513   499
20230119    cframe-z7-00163860.fits.gz  42395   1

lastly, I ve also run my code snippet on all cframes in the daily for May-Dec 2021, and find the following:

# NIGHT CFRAME TILEID ROWS
20211206    cframe-z7-00112669.fits.gz  42853   0
20211220    cframe-z0-00114859.fits.gz  41685   0
20211218    cframe-z7-00114623.fits.gz  41372   0,1,3
20211218    cframe-z7-00114625.fits.gz  41244   0
20211220    cframe-z7-00114859.fits.gz  41685   0,3
20211217    cframe-z7-00114493.fits.gz  42760   0
20211218    cframe-z7-00114624.fits.gz  41379   0
20211217    cframe-z7-00114487.fits.gz  41710   1,4

hopefully that list should be exhaustive for the main survey. as remarked before, this always happen for:

sbailey commented 3 months ago

Tile 41685 is the only tile which has this propagate to EFFTIME_SPEC=nan in surveyops/trunk/ops/tiles-spectstatus.ecsv, which caused it to not be processed for Jura. I'm adding this ticket to the Kilimanjaro dashboard to fix before that run.