desihub / nightwatch

DESI Nightwatch: online data quality assurance
BSD 3-Clause "New" or "Revised" License
2 stars 2 forks source link

Failed processing of expid 165869 on 20230201 #334

Open sybenzvi opened 1 year ago

sybenzvi commented 1 year ago

Science exposure 165869 failed to process at NERSC on 20230201 and needs to be processed manually.

sybenzvi commented 1 year ago

Attempted to process this exposure on cori using

nightwatch run --outdir $CFS/desi/spectro/nightwatch/nersc -n 20230201 -e 165869

and encountered the following errors:

07:30 Running QA analysis
here qa.runner /global/cfs/cdirs/desi/spectro/nightwatch/nersc/20230201/00165869
INFO:runner.py:89:run: 07:30 Running QAAmp PER_AMP
WARNING:runner.py:94:run: QAAmp failed on /global/cfs/cdirs/desi/spectro/nightwatch/nersc/20230201/00165869 because FITSIO status = 107: tried to move past end of file
ffopen could not interpret primary array header of file: 
/global/cfs/cdirs/desi/spectro/nightwatch/nersc/20230201/00165869/preproc-z6-001
65869_tmp8978.fits
; skipping
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/qa/amp.py", line 98, in get_dico
    hdr = fitsio.read_header(filename, 'IMAGE') #- for readnoise, bias
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/site-packages/fitsio/fitslib.py", line 198, in read_header
    _fits = _fitsio_wrap.FITS(filename, READONLY, dont_create)
OSError: FITSIO status = 107: tried to move past end of file
ffopen could not interpret primary array header of file: 
/global/cfs/cdirs/desi/spectro/nightwatch/nersc/20230201/00165869/preproc-z6-001
65869_tmp8978.fits

"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/qa/runner.py", line 92, in run
    qa_results = qa.run(indir)
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/qa/amp.py", line 75, in run
    results = pool.starmap(get_dico, argslist)
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 771, in get
    raise self._value
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
    return list(itertools.starmap(args[0], args[1]))
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/qa/amp.py", line 98, in get_dico
    hdr = fitsio.read_header(filename, 'IMAGE') #- for readnoise, bias
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/site-packages/fitsio/fitslib.py", line 198, in read_header
    _fits = _fitsio_wrap.FITS(filename, READONLY, dont_create)
OSError: FITSIO status = 107: tried to move past end of file
ffopen could not interpret primary array header of file: 
/global/cfs/cdirs/desi/spectro/nightwatch/nersc/20230201/00165869/preproc-z6-001
65869_tmp8978.fits

Then during updates of the exposure tables, encountered an additional error as a consequence of the previous failed update of the QA tables:

07:31 Updating night/exposure summary tables
INFO:run.py:638:write_tables: Tabulating exposures in /global/cfs/cdirs/desi/spectro/nightwatch/nersc
...
exptime=0, chose threshold file /global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/threshold_files/DX-20220129.json
exptime=0, chose threshold file /global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/threshold_files/DY-20220129.json
exptime=0, chose threshold file /global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/threshold_files/XSIG-20190718.json
exptime=0, chose threshold file /global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/threshold_files/YSIG-20190718.json
Traceback (most recent call last):
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/bin/nightwatch", line 6, in <module>
    sys.exit(nightwatch.script.main())
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/script.py", line 56, in main
    main_run()
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/script.py", line 361, in main_run
    run.write_tables(args.outdir, args.outdir, expnights=[night,])
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/run.py", line 708, in write_tables
    web_tables.write_exposures_tables(indir, outdir, exposures, nights=expnights)
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/webpages/tables.py", line 205, in write_exposures_tables
    status = get_status(qadata, night)
  File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/qa/status.py", line 51, in get_status
    data = qadata['PER_AMP']
KeyError: 'PER_AMP'
sbailey commented 1 year ago

This smells like a NERSC transient I/O failure leaving behind corrupted files that are blocking reprocessing. To recover the exposure, it would likely work to completely remove any outputs from that exposure and try again.

If that is indeed the problem, it would be worth investigating why the failed processing left it in a not-easily-recoverable state. The normal philosophy is that all file writing should use an intermediate temp filename (which it looks like it is) and that any leftover tempfiles should be ignored when reprocessing (which it apparently isn't since it's still tripping on a tmp file from Feb 2).

jose-bermejo commented 1 year ago

I manually deleted the output and reprocessed this exposure as Stephen suggested and it seems to work.