desihub / specex

DESI spectrograph PSF fitting
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

floating point exception in specex.py #32

Closed marcelo-alvarez closed 4 years ago

marcelo-alvarez commented 4 years ago

When merging per-bundle fits files in desi_compute_psf_mpi on a single node a crash occurs at the line

i=np.where(other_psf_hdulist["PSF"].data["PARAM"]=="STATUS")[0][0]

in specex.py due to a floating point exception:

export INDIR=/global/cfs/cdirs/desi/spectro/redux/andes
srun -N 1 -n 20 -c 2 desi_compute_psf_mpi --input-image $INDIR/preproc/20200315/00055705/preproc-r0-00055705.fits --input-psf $INDIR/exposures/20200315/00055705/shifted-input-psf-r0-00055705.fits --output-psf $SCRATCH/desi/psf/fit-psf-r0-00055705.fits --broken-fibers 367
...
INFO:specex.py:242:main: HACK: taking a 20 sec pause before merging
INFO:specex.py:281:merge_psf: Will merge 20 PSFs in /global/cscratch1/sd/malvarez/desi/psf/fit-psf-r0-00055705.fits
INFO:specex.py:286:merge_psf: merging /global/cscratch1/sd/malvarez/desi/psf/fit-psf-r0-00055705_01.fits into /global/cscratch1/sd/malvarez/desi/psf/fit-psf-r0-00055705_00.fits
srun: error: nid00228: task 0: Floating point exception
srun: Terminating job step 34226453.3
slurmstepd: error: *** STEP 34226453.3 ON nid00228 CANCELLED AT 2020-09-10T12:53:49 ***
srun: error: nid00228: tasks 1-19: Terminated
srun: Force Terminated job step 34226453.3

Removing feenableexcept (FE_INVALID|FE_DIVBYZERO|FE_OVERFLOW) from specex_desi_main.cc prevents the crash from occurring, but also alters the intended behaviour of the code.

The proposed solution is adding the line:

fedisableexcept (FE_INVALID|FE_DIVBYZERO|FE_OVERFLOW);

before

return EXIT_SUCCESS; 

in specex_desi_main.cc. This allows floating point exceptions to halt execution, as originally intended, while not causing the crash during python execution of specex.py.

Given that the floating point exception occurs during a routine and otherwise successful operation in python, the cause of the crash is probably not worth investigating further at this point. If @sbailey and @julienguy agree, we can make the change above and close this issue.

sbailey commented 4 years ago

Thanks for documenting this. I agree that adding fedisableexcept in specex_desi_main.cc just before the return sounds like the right solution. As a double check, please confirm that the temporary per-bundle files that are being merged do not have NaNs in them.