Open sbailey opened 1 month ago
More info: I have only see this error when DistTargetsDESI is reading the 3D data of the resolution matrix. I have not seen this error when other pipeline steps are reading the equivalent HDUs of other upstream files, and I have not seen this for Redrock reading other 2D HDUs. So an intermittent I/O problem smells like something on the NERSC side, but having it isolated to a specific code reading a specific HDU smells like something on our side. Or a super-corner-case in how this particular code reads that particular HDU triggering some I/O system bug.
I haven't heard about other reports at NERSC that sound familiar.
My initial suspicion would be a job dependency issue. For the main-dark-17625 26153196
case, it looks like the timestamp on the input file is more recent than the job end time 2024-05-29T07:38:58
. Is that expected?
-rw-r----- 1 desi desi 1.2G May 29 10:25 /global/cfs/cdirs/desi/spectro/redux/jura/healpix/main/dark/176/17625/coadd-main-dark-17625.fits
During the Jura run, we have encountered multiple cases of I/O errors of the form:
The incorrect array size varies with different jobs, and the same files/code work when resubmitted, though admittedly due to checkpoint/restart the jobs are resuming only with the previously failed step and aren't exactly reproducing all prior history.
Other examples (some failing when qso_qn calls redrock, some during the original redrock run)
The infiles are read from /dvs_ro/cfs/cdirs/desi/spectro/redux/jura/..., but it is unclear whether this is a CFS bug, or possibly an astropy installation bug or (less likely?) some corner case with the rows slicing. Documenting it here for the search record.
@dmargala does this sound familiar with any other reports at NERSC? I could file a NERSC ticket too, but the DESI mpi+astropy combination is so specific I'm not sure how useful that would be.