Intermittent array size I/O error while reading DistTargetsDESI

sbailey commented 1 month ago

During the Jura run, we have encountered multiple cases of I/O errors of the form:

# from jura healpix/main/dark/176/17625/logs/redrock-main-dark-17625.log.0
...
--- Process 0 raised an exception ---
Proc 0: Traceback (most recent call last):
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/0.20.0/lib/python3.10/site-packages/redrock/external/desi.py", line 929, in rrdesi
    targets = DistTargetsDESI(args.infiles, coadd=(not args.allspec),
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/code/redrock/0.20.0/lib/python3.10/site-packages/redrock/external/desi.py", line 570, in __init__
    hdata = hdus[extname].data[rows]
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/utils/decorators.py", line 837, in __get__
    val = self.fget(obj)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/image.py", line 250, in data
    data = self._get_scaled_image_data(self._data_offset, self.shape)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/image.py", line 809, in _get_scaled_image_data
    raw_data = self._get_raw_data(shape, code, offset)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/hdu/base.py", line 559, in _get_raw_data
    return self._file.readarray(offset=offset, dtype=code, shape=shape)
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20240425-2.2.0/conda/lib/python3.10/site-packages/astropy/io/fits/file.py", line 400, in readarray
    data.shape = shape
Proc 0: ValueError: cannot reshape array of size 768 into shape (2875,11,2881)

The incorrect array size varies with different jobs, and the same files/code work when resubmitted, though admittedly due to checkpoint/restart the jobs are resuming only with the previously failed step and aren't exactly reproducing all prior history.

Other examples (some failing when qso_qn calls redrock, some during the original redrock run)

healpix         jobid
main-dark-17625 26153196
main-dark-17352 26153083
main-dark-20239 26153880
main-dark-8676  26151991
main-dark-26147 26154319
main-dark-7272  26151594

The infiles are read from /dvs_ro/cfs/cdirs/desi/spectro/redux/jura/..., but it is unclear whether this is a CFS bug, or possibly an astropy installation bug or (less likely?) some corner case with the rows slicing. Documenting it here for the search record.

@dmargala does this sound familiar with any other reports at NERSC? I could file a NERSC ticket too, but the DESI mpi+astropy combination is so specific I'm not sure how useful that would be.

sbailey commented 1 month ago

More info: I have only see this error when DistTargetsDESI is reading the 3D data of the resolution matrix. I have not seen this error when other pipeline steps are reading the equivalent HDUs of other upstream files, and I have not seen this for Redrock reading other 2D HDUs. So an intermittent I/O problem smells like something on the NERSC side, but having it isolated to a specific code reading a specific HDU smells like something on our side. Or a super-corner-case in how this particular code reads that particular HDU triggering some I/O system bug.

dmargala commented 1 month ago

I haven't heard about other reports at NERSC that sound familiar.

My initial suspicion would be a job dependency issue. For the main-dark-17625 26153196 case, it looks like the timestamp on the input file is more recent than the job end time 2024-05-29T07:38:58. Is that expected?

-rw-r----- 1 desi desi 1.2G May 29 10:25 /global/cfs/cdirs/desi/spectro/redux/jura/healpix/main/dark/176/17625/coadd-main-dark-17625.fits

desihub / redrock

Intermittent array size I/O error while reading DistTargetsDESI #309