desihub / specex

DESI spectrograph PSF fitting
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

PSF writes file despite FATAL ERROR #56

Closed sbailey closed 2 years ago

sbailey commented 2 years ago

This command generates two "FATAL ERROR" messages and writes bundles 00 and 03 with 3x3 PSFs instead of 8x5 PSFs, before failing upon merging:

OUTDIR=/global/cfs/cdirs/desi/users/users/$USER/psftest
mkdir -p $OUTDIR
REDUXDIR=/global/cfs/cdirs/desi/spectro/redux/daily
export OMP_NUM_THREADS=3
srun -n 20 -c 3 desi_compute_psf_mpi \
   --input-image $REDUXDIR/preproc/20211005/00103082/preproc-z6-00103082.fits \
   --input-psf $REDUXDIR/exposures/20211005/00103082/shifted-input-psf-z6-00103082.fits \
   -o $OUTDIR/fit-psf-z6-00103082.fits &> $OUTDIR/psf-z6-00103082.log

in the logfile:

...
WARNING cholesky_solve failed with status 43
FATAL ERROR (other std) FitSeveralSpots failed for FLUX+TRACE (at line 2562 of file /tmp/pip-req-build-6u6z5d3_/src/specex_psf_fitter.cc)
...
WARNING problem with brent dchi2 = -1.38853e+07
FATAL ERROR (other std) FitSeveralSpots failed for FLUX+TRACE (at line 2562 of file /tmp/pip-req-build-6u6z5d3_/src/specex_psf_fitter.cc)
...
INFO:specex.py:364:merge_psf: Will merge 20 PSFs in /global/cfs/cdirs/desi/users/users/sjbailey/psftest/fit-psf-z6-00103082.fits
INFO:specex.py:369:merge_psf: merging /global/cfs/cdirs/desi/users/users/sjbailey/psftest/fit-psf-z6-00103082_01.fits into /global/cfs/cdirs/desi/users/users/sjbailey/psftest/fit-psf-z6-00103082_00.fits
srun: error: nid00015: task 0: Floating point exception
srun: launch/slurm: _step_signal: Terminating StepId=50602372.1
slurmstepd: error: *** STEP 50602372.1 ON nid00015 CANCELLED AT 2021-11-18T20:49:42 ***
srun: error: nid00015: tasks 1-19: Terminated
srun: Force Terminated StepId=50602372.1

Several problems / mysteries:

@marcelo-alvarez

marcelo-alvarez commented 2 years ago

Please see #57 for a solution to this (except for why the fit is failing in the first place, which others may be in a better position to investigate than I am). It turns out that #54 did not actually fix this problem when it came up last time, because the test I did failed to include the catching of the "fatal error" exception (in fit_psf) the prevented it from being caught . See #53 for more details in the full context.

sbailey commented 2 years ago

fixed in #57 (still writes per-bundle output files, but provides the hooks to not do the merge to the final file)