desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
36 stars 24 forks source link

raw data with incorrect NIGHT keyword causes pipelining problems #2248

Closed sbailey closed 3 months ago

sbailey commented 4 months ago

Dark exposure desi/spectro/data/20211205/00112406/desi-00112406.fits.fz has header keywords

NIGHT   =             20211204

i.e. the NIGHT in the header (20211204) is a mismatch of the night directory on disk (20211205).

exposure_tables/202112/exposure_table_20211205.csv recognizes this, and includes NIGHT=20211204 for that exposure:

EXPID,OBSTYPE,TILEID,...,NIGHT,HEADERERR,EXPFLAG,COMMENTS
112406,dark,-99,...,20211204,|,|,|
112407,dark,-99,...,20211205,|,|,|

However, when desi_proc_night goes to submit 20211205, it submits the ccdcalib job associated with the NIGHT of the first dark exposure, e.g. 20211204.

The first exposure of 20211204/00112196 also had incorrect NIGHT keyword, but a different dark with the correct header was selected as the ccdcalib dark for 20211204, so it also wrote a ccdcalib-20211204-00112209-a0123456789.slurm script and the two nights collided with each other.

@weaverba137 I feel like we've encountered this issue before, and maybe even patched some historical data. Could you investigate the history of this, what we may have done in the past, and make an inventory of cases where the NIGHT header of the first exposure of the night is listed as being the previous night, i.e. a mismatch with the directory?

@akremin in this particular case we have multiple darks on the night so we could just flag the first dark with the bad NIGHT keyword as LASTSTEP=ignore and then resubmit. Or perhaps the calibration selection / pipeline logic should do something with these cases to keep the jobs associated with the processing night even if the header keyword mis-matches the directory, but that would take longer to develop and test.

weaverba137 commented 4 months ago

Without investigating further, I can say that we've encountered this issue with raw data transfers, although I'm not aware of investigating this before in the context of spectroscopic processing.

Ultimately this is an upstream problem where the night associated with an exposure might be out of sync with the process that writes the file to disk, and/or the process that creates the symlink that is the signal to transfer an exposure to NERSC. I have occasionally had to log in to KPNO to remove or correct bad symlinks.

In all cases that I have ever seen the exposure involved is exactly on the night rollover boundary, so either the very first or the very last exposure of a night, other exposures are not affected.

akremin commented 4 months ago

These are rare enough that from a spectro processing perspective, I think we treat each as they come up.

In this case, since we have other darks to use, I think we mark it to be ignored and move on. Data at the night boundary aren't as representative of the night as data acquired closer to the science data anyway.

weaverba137 commented 4 months ago

Some really old desi-EXPID.fits.fz files don't actually have NIGHT defined as a header keyword, so that complicates things. Is there a range of nights that should be searched?

sbailey commented 4 months ago

Is there a range of nights that should be searched?

20201214 and after.

I think it would be sufficient to check the first and last exposure of every night to flag nights of concern; I don't think we need to check every exposure in the middle of the nights.

weaverba137 commented 4 months ago

Here is the initial list.

NIGHT mismatch for 20201216/00068160 (first exposure).
NIGHT mismatch for 20210104/00070811 (last exposure).
NIGHT mismatch for 20210105/00070812 (first exposure).
NIGHT mismatch for 20210125/00073192 (first exposure).
NIGHT mismatch for 20210128/00073330 (first exposure).
NIGHT mismatch for 20210327/00082373 (first exposure).
NIGHT mismatch for 20210406/00083620 (first exposure).
NIGHT mismatch for 20210411/00084263 (first exposure).
NIGHT mismatch for 20210616/00094802 (first exposure).
NIGHT mismatch for 20210629/00096513 (first exposure).
NIGHT mismatch for 20210816/00098702 (first exposure).
NIGHT mismatch for 20210823/00099136 (first exposure).
NIGHT mismatch for 20210909/00099314 (first exposure).
NIGHT mismatch for 20211019/00105096 (first exposure).
NIGHT mismatch for 20211105/00107440 (first exposure).
NIGHT mismatch for 20211107/00107713 (first exposure).
NIGHT mismatch for 20211109/00108049 (first exposure).
NIGHT mismatch for 20211116/00108976 (first exposure).
NIGHT mismatch for 20211204/00112196 (first exposure).
NIGHT mismatch for 20211205/00112406 (first exposure).
NIGHT mismatch for 20220113/00118403 (first exposure).
NIGHT mismatch for 20221004/00145825 (first exposure).
NIGHT mismatch for 20221109/00152475 (first exposure).
NIGHT mismatch for 20221230/00161209 (first exposure).
NIGHT mismatch for 20230511/00180072 (first exposure).
NIGHT mismatch for 20230623/00186525 (first exposure).
NIGHT mismatch for 20230718/00187460 (first exposure).
NIGHT mismatch for 20230730/00188221 (first exposure).
NIGHT mismatch for 20230823/00192200 (last exposure).
NIGHT mismatch for 20230922/00197450 (first exposure).
NIGHT mismatch for 20231213/00209383 (first exposure).
NIGHT mismatch for 20231214/00209664 (first exposure).
NIGHT mismatch for 20231217/00210182 (first exposure).
NIGHT mismatch for 20231221/00211611 (first exposure).
NIGHT mismatch for 20231222/00211971 (first exposure).
NIGHT mismatch for 20231223/00212748 (first exposure).
NIGHT mismatch for 20231224/00213243 (first exposure).
NIGHT mismatch for 20240312/00229939 (first exposure).
NIGHT mismatch for 20240315/00230511 (first exposure).
NIGHT mismatch for 20240326/00232443 (first exposure).
NIGHT mismatch for 20240404/00234183 (first exposure).
NIGHT mismatch for 20240406/00234577 (first exposure).
NIGHT mismatch for 20240511/00236130 (first exposure).
NIGHT mismatch for 20240512/00236252 (first exposure).

One of these produced and error message:

ERROR: Empty or corrupt FITS file [astropy.io.fits.scripts.fitsheader]

I will track down which one.

weaverba137 commented 4 months ago

20230823/00192200 is the bad file. It is zero bytes on disk.

sbailey commented 4 months ago

For the specific case of 20211205, I flagged the first exposure with the wrong NIGHT keyword as bad since it was tripping up the pipeline, and resubmitted the night. I'll leave this ticket open for when we have time to investigate/fix the impact of others.

sbailey commented 3 months ago

20211205 is the only night that caused problems for Jura. The exposure is now flagged as bad so it should also be fine for Kilimanjaro. The other cases are likely darks or test exposures that aren't used by the prod but were running during the day and spanned the rollover. I'm going to close this ticket since we don't need to take action for future prods.