Closed sbailey closed 3 months ago
Without investigating further, I can say that we've encountered this issue with raw data transfers, although I'm not aware of investigating this before in the context of spectroscopic processing.
Ultimately this is an upstream problem where the night associated with an exposure might be out of sync with the process that writes the file to disk, and/or the process that creates the symlink that is the signal to transfer an exposure to NERSC. I have occasionally had to log in to KPNO to remove or correct bad symlinks.
In all cases that I have ever seen the exposure involved is exactly on the night rollover boundary, so either the very first or the very last exposure of a night, other exposures are not affected.
These are rare enough that from a spectro processing perspective, I think we treat each as they come up.
In this case, since we have other darks to use, I think we mark it to be ignored and move on. Data at the night boundary aren't as representative of the night as data acquired closer to the science data anyway.
Some really old desi-EXPID.fits.fz
files don't actually have NIGHT
defined as a header keyword, so that complicates things. Is there a range of nights that should be searched?
Is there a range of nights that should be searched?
20201214 and after.
I think it would be sufficient to check the first and last exposure of every night to flag nights of concern; I don't think we need to check every exposure in the middle of the nights.
Here is the initial list.
NIGHT mismatch for 20201216/00068160 (first exposure).
NIGHT mismatch for 20210104/00070811 (last exposure).
NIGHT mismatch for 20210105/00070812 (first exposure).
NIGHT mismatch for 20210125/00073192 (first exposure).
NIGHT mismatch for 20210128/00073330 (first exposure).
NIGHT mismatch for 20210327/00082373 (first exposure).
NIGHT mismatch for 20210406/00083620 (first exposure).
NIGHT mismatch for 20210411/00084263 (first exposure).
NIGHT mismatch for 20210616/00094802 (first exposure).
NIGHT mismatch for 20210629/00096513 (first exposure).
NIGHT mismatch for 20210816/00098702 (first exposure).
NIGHT mismatch for 20210823/00099136 (first exposure).
NIGHT mismatch for 20210909/00099314 (first exposure).
NIGHT mismatch for 20211019/00105096 (first exposure).
NIGHT mismatch for 20211105/00107440 (first exposure).
NIGHT mismatch for 20211107/00107713 (first exposure).
NIGHT mismatch for 20211109/00108049 (first exposure).
NIGHT mismatch for 20211116/00108976 (first exposure).
NIGHT mismatch for 20211204/00112196 (first exposure).
NIGHT mismatch for 20211205/00112406 (first exposure).
NIGHT mismatch for 20220113/00118403 (first exposure).
NIGHT mismatch for 20221004/00145825 (first exposure).
NIGHT mismatch for 20221109/00152475 (first exposure).
NIGHT mismatch for 20221230/00161209 (first exposure).
NIGHT mismatch for 20230511/00180072 (first exposure).
NIGHT mismatch for 20230623/00186525 (first exposure).
NIGHT mismatch for 20230718/00187460 (first exposure).
NIGHT mismatch for 20230730/00188221 (first exposure).
NIGHT mismatch for 20230823/00192200 (last exposure).
NIGHT mismatch for 20230922/00197450 (first exposure).
NIGHT mismatch for 20231213/00209383 (first exposure).
NIGHT mismatch for 20231214/00209664 (first exposure).
NIGHT mismatch for 20231217/00210182 (first exposure).
NIGHT mismatch for 20231221/00211611 (first exposure).
NIGHT mismatch for 20231222/00211971 (first exposure).
NIGHT mismatch for 20231223/00212748 (first exposure).
NIGHT mismatch for 20231224/00213243 (first exposure).
NIGHT mismatch for 20240312/00229939 (first exposure).
NIGHT mismatch for 20240315/00230511 (first exposure).
NIGHT mismatch for 20240326/00232443 (first exposure).
NIGHT mismatch for 20240404/00234183 (first exposure).
NIGHT mismatch for 20240406/00234577 (first exposure).
NIGHT mismatch for 20240511/00236130 (first exposure).
NIGHT mismatch for 20240512/00236252 (first exposure).
One of these produced and error message:
ERROR: Empty or corrupt FITS file [astropy.io.fits.scripts.fitsheader]
I will track down which one.
20230823/00192200
is the bad file. It is zero bytes on disk.
For the specific case of 20211205, I flagged the first exposure with the wrong NIGHT keyword as bad since it was tripping up the pipeline, and resubmitted the night. I'll leave this ticket open for when we have time to investigate/fix the impact of others.
20211205 is the only night that caused problems for Jura. The exposure is now flagged as bad so it should also be fine for Kilimanjaro. The other cases are likely darks or test exposures that aren't used by the prod but were running during the day and spanned the rollover. I'm going to close this ticket since we don't need to take action for future prods.
Dark exposure desi/spectro/data/20211205/00112406/desi-00112406.fits.fz has header keywords
i.e. the NIGHT in the header (20211204) is a mismatch of the night directory on disk (20211205).
exposure_tables/202112/exposure_table_20211205.csv recognizes this, and includes NIGHT=20211204 for that exposure:
However, when desi_proc_night goes to submit 20211205, it submits the ccdcalib job associated with the NIGHT of the first dark exposure, e.g. 20211204.
The first exposure of 20211204/00112196 also had incorrect NIGHT keyword, but a different dark with the correct header was selected as the ccdcalib dark for 20211204, so it also wrote a ccdcalib-20211204-00112209-a0123456789.slurm script and the two nights collided with each other.
@weaverba137 I feel like we've encountered this issue before, and maybe even patched some historical data. Could you investigate the history of this, what we may have done in the past, and make an inventory of cases where the NIGHT header of the first exposure of the night is listed as being the previous night, i.e. a mismatch with the directory?
@akremin in this particular case we have multiple darks on the night so we could just flag the first dark with the bad NIGHT keyword as LASTSTEP=ignore and then resubmit. Or perhaps the calibration selection / pipeline logic should do something with these cases to keep the jobs associated with the processing night even if the header keyword mis-matches the directory, but that would take longer to develop and test.