desihub / specex

DESI spectrograph PSF fitting
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

Use return value to signal a fatal error #57

Closed marcelo-alvarez closed 2 years ago

marcelo-alvarez commented 2 years ago

Modifies function run_specex to return the same value as fit_psf signaling fatal errors so that calling functions can handle them appropriately.

Currently, any call to SPECEX_ERROR during execution of fit_psf results in an exception that is caught in fit_psf, resulting in a non-zero return value of EXIT_FAILURE, but since run_specex does not use the return value, this information is lost.

Testing at NERSC with arc exposure 103082 (in which SPECEX_ERROR calls currently occur for camera z6) using the command

srun -N 1 -n 41 -c 1 desi_proc --traceshift --cameras z67 -n 20211005 -e 103082 --mpi

with this branch leads to the expected behaviour that the PSF for z6 is not written (while the PSF for z7 is) the error is reported, and the processes continue as normal. With the master branch, PSFs for both z6 and z7 are written, with erroneous results (e.g., inconsistent values of HSIZEX between bundles) in the z6 PSF.

This fixes #53 and subsequent silent PSF fitting failures not fixed by #54. See #53 for additional details. Please review and merge if appropriate.

julienguy commented 2 years ago

I am not sure I understand. In the code py/specex/specex.py , the PSF is still written if the return value is non zero.

sbailey commented 2 years ago

I'm still trying to understand by tests using this PR, but one comment:

specex reports messages like

FATAL ERROR (other std) FitSeveralSpots failed for FLUX+TRACE (at line 2562 of file /global/common/software/desi/users/sjbailey/specex/src/specex_psf_fitter.cc)

and the desispec wrapper reports

some bundles failed desi_psf_fit

but I'm having a hard time telling which bundles failed. The 0th order fix is not writing outputs that are wrong, but coming next will be having enough logging to figure out why it is failing and whether we need to update CCD masks or something to prevent it from failing in the first place.

marcelo-alvarez commented 2 years ago

I am not sure I understand. In the code py/specex/specex.py , the PSF is still written if the return value is non zero.

The bundle PSFs are written, which might be helpful e.g. for debugging after the fact (as they may contain additional information on top of what is reported in the logs and in the SPECEX_ERROR message). But the bundles are not merged, so no merged PSF is written.

julienguy commented 2 years ago

ok thanks. I am merging for now so we can run the prod.

julienguy commented 2 years ago

(I ran a successful test)