Open sybenzvi opened 1 year ago
Attempted to process this exposure on cori using
nightwatch run --outdir $CFS/desi/spectro/nightwatch/nersc -n 20230201 -e 165869
and encountered the following errors:
07:30 Running QA analysis
here qa.runner /global/cfs/cdirs/desi/spectro/nightwatch/nersc/20230201/00165869
INFO:runner.py:89:run: 07:30 Running QAAmp PER_AMP
WARNING:runner.py:94:run: QAAmp failed on /global/cfs/cdirs/desi/spectro/nightwatch/nersc/20230201/00165869 because FITSIO status = 107: tried to move past end of file
ffopen could not interpret primary array header of file:
/global/cfs/cdirs/desi/spectro/nightwatch/nersc/20230201/00165869/preproc-z6-001
65869_tmp8978.fits
; skipping
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/qa/amp.py", line 98, in get_dico
hdr = fitsio.read_header(filename, 'IMAGE') #- for readnoise, bias
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/site-packages/fitsio/fitslib.py", line 198, in read_header
_fits = _fitsio_wrap.FITS(filename, READONLY, dont_create)
OSError: FITSIO status = 107: tried to move past end of file
ffopen could not interpret primary array header of file:
/global/cfs/cdirs/desi/spectro/nightwatch/nersc/20230201/00165869/preproc-z6-001
65869_tmp8978.fits
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/qa/runner.py", line 92, in run
qa_results = qa.run(indir)
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/qa/amp.py", line 75, in run
results = pool.starmap(get_dico, argslist)
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 372, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 771, in get
raise self._value
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/multiprocessing/pool.py", line 51, in starmapstar
return list(itertools.starmap(args[0], args[1]))
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/qa/amp.py", line 98, in get_dico
hdr = fitsio.read_header(filename, 'IMAGE') #- for readnoise, bias
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/conda/lib/python3.9/site-packages/fitsio/fitslib.py", line 198, in read_header
_fits = _fitsio_wrap.FITS(filename, READONLY, dont_create)
OSError: FITSIO status = 107: tried to move past end of file
ffopen could not interpret primary array header of file:
/global/cfs/cdirs/desi/spectro/nightwatch/nersc/20230201/00165869/preproc-z6-001
65869_tmp8978.fits
Then during updates of the exposure tables, encountered an additional error as a consequence of the previous failed update of the QA tables:
07:31 Updating night/exposure summary tables
INFO:run.py:638:write_tables: Tabulating exposures in /global/cfs/cdirs/desi/spectro/nightwatch/nersc
...
exptime=0, chose threshold file /global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/threshold_files/DX-20220129.json
exptime=0, chose threshold file /global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/threshold_files/DY-20220129.json
exptime=0, chose threshold file /global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/threshold_files/XSIG-20190718.json
exptime=0, chose threshold file /global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/threshold_files/YSIG-20190718.json
Traceback (most recent call last):
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/bin/nightwatch", line 6, in <module>
sys.exit(nightwatch.script.main())
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/script.py", line 56, in main
main_run()
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/script.py", line 361, in main_run
run.write_tables(args.outdir, args.outdir, expnights=[night,])
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/run.py", line 708, in write_tables
web_tables.write_exposures_tables(indir, outdir, exposures, nights=expnights)
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/webpages/tables.py", line 205, in write_exposures_tables
status = get_status(qadata, night)
File "/global/common/software/desi/cori/desiconda/20211217-2.0.0/code/nightwatch/master/py/nightwatch/qa/status.py", line 51, in get_status
data = qadata['PER_AMP']
KeyError: 'PER_AMP'
This smells like a NERSC transient I/O failure leaving behind corrupted files that are blocking reprocessing. To recover the exposure, it would likely work to completely remove any outputs from that exposure and try again.
If that is indeed the problem, it would be worth investigating why the failed processing left it in a not-easily-recoverable state. The normal philosophy is that all file writing should use an intermediate temp filename (which it looks like it is) and that any leftover tempfiles should be ignored when reprocessing (which it apparently isn't since it's still tripping on a tmp file from Feb 2).
I manually deleted the output and reprocessed this exposure as Stephen suggested and it seems to work.
Science exposure 165869 failed to process at NERSC on 20230201 and needs to be processed manually.