desihub / specex

DESI spectrograph PSF fitting
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

Silent PSF fit fatal error #53

Closed julienguy closed 2 years ago

julienguy commented 3 years ago

Some (all?) fatal errors in the PSF fit are not reported as errors by the pipeline. Example: grep 'FATAL ERROR' /global/cfs/cdirs/desi/spectro/redux/daily/run/scripts/night/20210912/arc-20210912-00099806-a0123456789-46910109.log | nl shows 47 fatal errors but all the PSF are still saved so this is unnoticed ??

marcelo-alvarez commented 3 years ago

Note: the likely solution to this bug only requires modificaitons to specex, with no changes to desispec necessary. Line numbers below are for the tagged version of specex/0.7.0

Origin of the bug

SPECEX_ERROR() was called in the function FitEverything() on line 2884 of specex_psf_fitter.cc when FitSeveralSpots() returned false because cholesky_solve() returned status = 1:

ok = FitSeveralSpots(selected_spots,&chi2,&npix,&niter);
if(!ok) SPECEX_ERROR("FitSeveralSpots failed for FLUX+TRACE"); 

but an exception was not thrown and consequently it was neither trapped, translated nor raised in Python by the C++ exception handler included in pybind11.

Desired behavior

Likely solution

  1. Ensure that SPECEX_ERROR() throws an exception so it can be trapped, translated and raised in Python by the C++ exception handler included in pybind11, and subsequently caught upstream in desispec and logged.

Action planned

Test this solution to make sure it results in desired behaviour and open a specex pull request to merge the changes if so.

marcelo-alvarez commented 2 years ago

@julienguy @sbailey I have re-opened this issue because silent PSF fatal errors are still occurring.

The previous modifcation, #54, only partially fixed the problem by throwing an exception anytime SPECEX_ERROR is called. The test done to verify this fix (quoting from #54):

Verified this solution by hardcoding a call to SPECEX_ERROR when a specific camera string is in the arc preproc exposure image file name and verifying that the PSF for that camera is not written, an error is logged, the rest of the PSFs are written, and the job proceeds normally.

did not actually test the real-world situation because a return was added to the fit_psf before the end, where the exception thrown by SPECEX_ERROR would have been caught, preventing it from being caught in the calling functions, i.e. on the desispec side. This is now fixed in #57.