desihub / specex

DESI spectrograph PSF fitting
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

newly failing z7 PSFs #69

Closed sbailey closed 1 year ago

sbailey commented 1 year ago

The following PSFs failed the p1 test production (software environment 23.6 = specex/0.8.5 + desispec/0.59.1), but succeeded when they were initially run on daily:

20220913/00142170/fit-psf-z7-00142170.fits
20220922/00143900/fit-psf-z7-00143900.fits
20220924/00144161/fit-psf-z7-00144161.fits
20221004/00145881/fit-psf-z7-00145881.fits
20221013/00148067/fit-psf-z7-00148067.fits
20221016/00148718/fit-psf-z7-00148718.fits
20221020/00149258/fit-psf-z7-00149258.fits
20221021/00149387/fit-psf-z7-00149387.fits
20221021/00149389/fit-psf-z7-00149389.fits
20221031/00150967/fit-psf-z7-00150967.fits
20221031/00150969/fit-psf-z7-00150969.fits
20221105/00152007/fit-psf-z7-00152007.fits
20221110/00152780/fit-psf-z7-00152780.fits
20221111/00152933/fit-psf-z7-00152933.fits
20221115/00153520/fit-psf-z7-00153520.fits

It is suspicious that they are all z7. Although we expect some level of arc fitting failure, we don't expect daily successes to become failures in later productions.

From the logs:

ERROR:specex.py:199:main: desi_psf_fit on process 19 failed with return value 1 running desi_psf_fit -a /dvs_ro/cfs/cdirs/desi/spectro/redux/p1/preproc/20220913/00142170/preproc-z7-00142170.fits.gz --in-psf /dvs_ro/cfs/cdirs/desi/spectro/redux/p1/exposures/20220913/00142170/shifted-input-psf-z7-00142170.fits --out-psf /global/cfs/cdirs/desi/spectro/redux/p1/exposures/20220913/00142170/fit-psf-z7-00142170_19.fits --lamp-lines /global/common/software/desi/perlmutter/desiconda/20230111-2.1.0/code/specex/0.8.5/lib/python3.10/site-packages/specex/data/specex_linelist_desi.txt --first-bundle 19 --last-bundle 19 --first-fiber 475 --last-fiber 499 --legendre-deg-wave 3 --fit-continuum --broken-fibers 20,87,134,198,199,256,257,304,319,320,414

That output file /global/cfs/cdirs/desi/spectro/redux/p1/exposures/20220913/00142170/fit-psf-z7-00142170_19.fits exists but rerunning that command with --out-psf $SCRATCH/fit-psf-z7-00142170_19.fits instead writes the file but crashes with

INFO z7-00142170-19: trace fit: scaling down parameter step by 0.105085
FATAL ERROR (other std) ERROR z7-00142170-19: problem with brent dchi2 = -214.265 (at line 1583 of file /global/common/software/desi/perlmutter/desiconda/20230111-2.1.0/code/specex/specex-0.8.5/src/specex_psf_fitter.cc)
INFO z7-00142170-19: prepare XTRACE
INFO z7-00142170-19: write YTRACE
INFO z7-00142170-19: load PSF
INFO z7-00142170-19: loaded [x,y]trace and PSF

Possibly related: z7 is one of the CCDs that sometimes has wavelength solutions that extend slightly off the edge of the CCD (which lead to EFFTIME_SPEC NaN in desihub/desispec#1978). If that's the case here, a slight consistency prior could help.

Grepping the logs for "using DARK" and "Using MASK", it appears that same calibrations were used in both cases (daily vs. p1), so this might be a Cori vs. Perlmutter edge case rather than a change in underlying calibrations not masking something etc.

@marcelo-alvarez please debug why these are newly failing and whether we can be more robust to these cases. Thanks.

Heads up @julienguy @akremin .

marcelo-alvarez commented 1 year ago

@sbailey this may not be ultimately relevant to the resolution of this issue but, for the record, the first three arc failures listed above used pre-#1818 dark models from DESI_SPECTRO_CALIB:

20220913 00142170
   daily -- /global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk/ccd/dark-sm8-z7-20211124.fits.gz
      p1 -- /global/cfs/cdirs/desi/spectro/desi_spectro_dark/v2209/dark_frames/dark-sm8-z7-20220913.fits.gz
20220922 00143900
   daily -- /global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk/ccd/dark-sm8-z7-20211124.fits.gz
      p1 -- /global/cfs/cdirs/desi/spectro/desi_spectro_dark/v2209/dark_frames/dark-sm8-z7-20220913.fits.gz
20220924 00144161
   daily -- /global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk/ccd/dark-sm8-z7-20211124.fits.gz
      p1 -- /global/cfs/cdirs/desi/spectro/desi_spectro_dark/v2209/dark_frames/dark-sm8-z7-20220913.fits.gz

since the changes in #1818 were merged on Oct 3 2022.

marcelo-alvarez commented 1 year ago

@sbailey @julienguy I ran desi_psf_fit for camera z7 of the arc exposures listed in the first comment above, using the main desi environment on Perlmutter but with the input files preproc-z7-EXPID.fits.gz and shifted-input-psf-z7-EXPID.fits each from either daily or p1. This is a total of 4 runs of desi_psf_fit for each exposure.

For example, when using the preproc from daily and the shifted-input-psf from p1, for exposure 153520, the commands run were:

source /global/common/software/desi/desi_environment.sh main
desi_psf_fit \
       -a /dvs_ro/cfs/cdirs/desi/spectro/redux/daily/preproc/20221115/00153520/preproc-z7-00153520.fits.gz \
 --in-psf /dvs_ro/cfs/cdirs/desi/spectro/redux/p1/exposures/20221115/00153520/shifted-input-psf-z7-00153520.fits \
 --out-psf /pscratch/sd/m/malvarez/pre_daily-psf_p1-fit-psf-z7-00153520_19.fits \
 --lamp-lines /global/common/software/desi/cori/desiconda/20211217-2.0.0/code/specex/main/py/specex/data/specex_linelist_desi.txt --first-bundle 19 --last-bundle 19 --first-fiber 475 --last-fiber 499 --legendre-deg-wave 3 --fit-continuum --broken-fibers 20,87,134,198,199,256,257,304,319,320,414

Here are the results:

   night     expid   preproc  inputpsf     error    status
========  ========  ========  ========  ========  ========
20220913  00142170     daily     daily         -         0
20220913  00142170        p1     daily         -         0
20220913  00142170     daily        p1         -         0
20220913  00142170        p1        p1     brent         1
========  ========  ========  ========  ========  ========
20220922  00143900     daily     daily         -         0
20220922  00143900        p1     daily         -         0
20220922  00143900     daily        p1  cholesky         1
20220922  00143900        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20220924  00144161     daily     daily         -         0
20220924  00144161        p1     daily         -         0
20220924  00144161     daily        p1  cholesky         1
20220924  00144161        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20221004  00145881     daily     daily         -         0
20221004  00145881        p1     daily         -         0
20221004  00145881     daily        p1  cholesky         1
20221004  00145881        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20221013  00148067     daily     daily         -         0
20221013  00148067        p1     daily         -         0
20221013  00148067     daily        p1  cholesky         1
20221013  00148067        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20221016  00148718     daily     daily         -         0
20221016  00148718        p1     daily         -         0
20221016  00148718     daily        p1  cholesky         1
20221016  00148718        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20221020  00149258     daily     daily         -         0
20221020  00149258        p1     daily         -         0
20221020  00149258     daily        p1         -         0
20221020  00149258        p1        p1     brent         1
========  ========  ========  ========  ========  ========
20221021  00149387     daily     daily         -         0
20221021  00149387        p1     daily         -         0
20221021  00149387     daily        p1         -         0
20221021  00149387        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20221021  00149389     daily     daily         -         0
20221021  00149389        p1     daily         -         0
20221021  00149389     daily        p1         -         0
20221021  00149389        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20221031  00150967     daily     daily         -         0
20221031  00150967        p1     daily         -         0
20221031  00150967     daily        p1  cholesky         1
20221031  00150967        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20221031  00150969     daily     daily         -         0
20221031  00150969        p1     daily         -         0
20221031  00150969     daily        p1         -         0
20221031  00150969        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20221105  00152007     daily     daily         -         0
20221105  00152007        p1     daily         -         0
20221105  00152007     daily        p1  cholesky         1
20221105  00152007        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20221110  00152780     daily     daily         -         0
20221110  00152780        p1     daily         -         0
20221110  00152780     daily        p1         -         0
20221110  00152780        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20221111  00152933     daily     daily         -         0
20221111  00152933        p1     daily         -         0
20221111  00152933     daily        p1         -         0
20221111  00152933        p1        p1  cholesky         1
========  ========  ========  ========  ========  ========
20221115  00153520     daily     daily         -         0
20221115  00153520        p1     daily         -         0
20221115  00153520     daily        p1     brent         1
20221115  00153520        p1        p1  cholesky         1

The stdout and stderr corresponding to each row above is in

$CFS/desi/users/malvarez/testscripts/pre_[daily,p1]-psf_[daily,p1]-fit-psf-z7-[expid].log

Note that when the input psf from daily is used, specex never crashes. This indicates that the differences that are causing the specex failures in p1 and not daily for the arc exposures listed in the first comment above are very likely upstream from specex, in the processing that generated the input PSF to specex, and not in specex itself.

julienguy commented 1 year ago

One can see easily with plot_fiber_traces -i /global/cfs/cdirs/desi/spectro/redux/p1/exposures/20220913/00142170/shifted-input-psf-z7-00142170.fits that fiber 3987 (#487) has a bad input trace so all PSF fits that use this input trace will fail. This happens because this fiber has a very low transmission. It is masked out as a bad fiber. The code knows about this and does not try to fit it. The issue is that this incorrect fiber trace overlaps with others and that's a problem.

The config from DESI_SPECTRO_CALIB is version V20220907 only valid from 20220907 to 20221120. Its input PSF is $DESI_SPECTRO_CALIB/spec/sm8/psf-sm8-z7-20220907-20220925.fits and has the incorrect fiber trace.

The next calibration config V20221121 has an updated PSF where this has been fixed, it is $DESI_SPECTRO_CALIB/spec/sm8/psf-z7-20220907-average.fits . I suggest we simply update the default PSF in the config file for V20220907 because the optics is the same as for V20221121.

I did this change and updated DESI_SPECTRO_CALIB at nersc. Can you please verify that this solved the problem?

marcelo-alvarez commented 1 year ago

@julienguy thanks for finding the solution. It would have taken me quite some time on my own to (re-)discover the usefulness of plot_fiber_traces to diagnose this problem.

I have verified that desi_fit_psf succeeds when using the shifted-input-psf file output from desi_compute_trace_shifts when psf-z7-20220907-average.fits is provided as an input to desi_compute_trace_shifts (config=current in the table below), but fails whenpsf-sm8-z7-20220907-20220925.fits is provided instead (config=preJul17 in the table below).

     night      expid    preproc     config      error     status
=================================================================
  20220913   00142170         p1   pre-Jul7      brent          1
  20220913   00142170         p1    current          -          0
-----------------------------------------------------------------
  20220922   00143900         p1   pre-Jul7   cholesky          1
  20220922   00143900         p1    current          -          0
-----------------------------------------------------------------
  20220924   00144161         p1   pre-Jul7   cholesky          1
  20220924   00144161         p1    current          -          0
-----------------------------------------------------------------
  20221004   00145881         p1   pre-Jul7   cholesky          1
  20221004   00145881         p1    current          -          0
-----------------------------------------------------------------
  20221013   00148067         p1   pre-Jul7   cholesky          1
  20221013   00148067         p1    current          -          0
-----------------------------------------------------------------
  20221016   00148718         p1   pre-Jul7   cholesky          1
  20221016   00148718         p1    current          -          0
-----------------------------------------------------------------
  20221020   00149258         p1   pre-Jul7      brent          1
  20221020   00149258         p1    current          -          0
-----------------------------------------------------------------
  20221021   00149387         p1   pre-Jul7   cholesky          1
  20221021   00149387         p1    current          -          0
-----------------------------------------------------------------
  20221021   00149389         p1   pre-Jul7   cholesky          1
  20221021   00149389         p1    current          -          0
-----------------------------------------------------------------
  20221031   00150967         p1   pre-Jul7   cholesky          1
  20221031   00150967         p1    current          -          0
-----------------------------------------------------------------
  20221031   00150969         p1   pre-Jul7   cholesky          1
  20221031   00150969         p1    current          -          0
-----------------------------------------------------------------
  20221105   00152007         p1   pre-Jul7   cholesky          1
  20221105   00152007         p1    current          -          0
-----------------------------------------------------------------
  20221110   00152780         p1   pre-Jul7   cholesky          1
  20221110   00152780         p1    current          -          0
-----------------------------------------------------------------
  20221111   00152933         p1   pre-Jul7   cholesky          1
  20221111   00152933         p1    current          -          0
-----------------------------------------------------------------
  20221115   00153520         p1   pre-Jul7   cholesky          1
  20221115   00153520         p1    current          -          0
-----------------------------------------------------------------

For example, first running

source /global/common/software/desi/desi_environment.sh main
desi_compute_trace_shifts \
                      -i /dvs_ro/cfs/cdirs/desi/spectro/redux/p1/preproc/20221115/00153520/preproc-z7-00153520.fits.gz \
                   --psf /dvs_ro/cfs/cdirs/desi/spectro/desi_spectro_calib/0.4.0/spec/sm8/psf-z7-20220907-average.fits \
                --outpsf $SCRATCH/shifted-input-psf.fits \
                 --degxx 0 \
                 --degxy 0 \
                 --degyx 0 \
                 --degyy 0 \
             --arc-lamps

and then

desi_psf_fit        -a /dvs_ro/cfs/cdirs/desi/spectro/redux/p1/preproc/20221115/00153520/preproc-z7-00153520.fits.gz \
                --in-psf $SCRATCH/shifted-input-psf.fits \
               --out-psf $SCRATCH/fit-psf.fits \
            --lamp-lines /global/common/software/desi/perlmutter/desiconda/20230111-2.1.0/code/specex/0.8.5/lib/python3.10/site-packages/specex/data/specex_linelist_desi.txt \
          --first-bundle 19 \
           --last-bundle 19 \
           --first-fiber 475 \
            --last-fiber 499 \
     --legendre-deg-wave 3 --fit-continuum --broken-fibers 20,87,134,198,199,256,257,304,319,320,414

results in desi_psf_fit completing successfully and returning status=0 (which can be verified by running echo $? immediately following desi_psf_fit and corresponds to the row with preproc=p1 and config=current in the table above), but first running

source /global/common/software/desi/desi_environment.sh main
desi_compute_trace_shifts \
                      -i /dvs_ro/cfs/cdirs/desi/spectro/redux/p1/preproc/20221115/00153520/preproc-z7-00153520.fits.gz \
                   --psf /dvs_ro/cfs/cdirs/desi/spectro/desi_spectro_calib/0.4.0/spec/sm8/psf-sm8-z7-20220907-20220925.fits \
                --outpsf $SCRATCH/shifted-input-psf.fits \
                 --degxx 0 \
                 --degxy 0 \
                 --degyx 0 \
                 --degyy 0 \
             --arc-lamps

results in desi_psf_fit reporting a FATAL ERROR instead and returning status=1 (which can be verified by running echo $? immediately following desi_psf_fit and corresponds to the row with preproc=p1 and config=preJul7 in the table above).

@sbailey currently for software environment 23.6

DESI_SPECTRO_CALIB=/global/cfs/cdirs/desi/spectro/desi_spectro_calib/0.5.0

and thus if 23.6 continued to be used as-is for p1 then this issue would be resolved by updating the configuration for z7 over the time interval from 20220907 to 20221120 in /global/cfs/cdirs/desi/spectro/desi_spectro_calib/0.5.0 in the same way that @julienguy has already done for /global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk, as described in the comment preceding this one.

Generally, in order for these PSFs to succeed for any future processing of p1, the update to the configuration for z7 over the time interval from 20220907 to 20221120 made in

/global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk

would have to be reflected in the particular location pointed to by DESI_SPECTRO_CALIB at the time of processing.

marcelo-alvarez commented 1 year ago

In case it helps, this is why these arcs currently fail for p1 using 23.6 with desi_spectro_calib/0.5.0 while they would succeed (as of the changes @julienguy made earlier today) for desi_spectro_calib/trunk instead:

% diff /global/cfs/cdirs/desi/spectro/desi_spectro_calib/0.5.0/spec/sm8/sm8-z.yaml \
       /global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk/spec/sm8/sm8-z.yaml
140c140
<   PSF: spec/sm8/psf-sm8-z7-20220907-20220925.fits
---
>   PSF: spec/sm8/psf-z7-20220907-average.fits
marcelo-alvarez commented 1 year ago

One can see easily with plot_fiber_traces -i /global/cfs/cdirs/desi/spectro/redux/p1/exposures/20220913/00142170/shifted-input-psf-z7-00142170.fits that fiber 3987 (#487) has a bad input trace so all PSF fits that use this input trace will fail. This happens because this fiber has a very low transmission. It is masked out as a bad fiber. The code knows about this and does not try to fit it. The issue is that this incorrect fiber trace overlaps with others and that's a problem.

I don't see fiber 3987 listed in any of the entries in

/global/cfs/cdirs/desi/spectro/desi_spectro_calib/trunk/spec/sm8/sm8-z.yaml

For V20220907 (valid from 20220907 to 20221120), the fibers listed are:

  BROKENFIBERS: 3520,3587,3634,3914
  BADCOLUMNFIBERS: 3698,3699,3756,3757,3804,3819,3820

and these correspond to those passed to specex in the p1 production, e.g.

% grep desi_psf /global/cfs/cdirs/desi/spectro/redux/p1/run/scripts/night/20220922/arc-20220922-00143901-a0123456789-10557679.log | grep psf\-z7 | grep "last-bundle 19" | awk '{print $(NF-1)" "$NF}'
--broken-fibers 20,87,134,198,199,256,257,304,319,320,414

does not contain 487 in the list of broken fibers.

@sbailey @julienguy is there a different way from the --broken-fibers option that specex knows fiber 3987 (#487) is a bad fiber, or is it this information external to specex?

julienguy commented 1 year ago

Looking at the daily fiberflat for z7, several fibers lost transmission during the 2021 summer shutdown. See for instance.

ds9 $DESI_SPECTRO_REDUX/iron/calibnight/{20210709,20210917}/fiberflatnight-z7-*.fits

(You can see the same variation in b and r so it's a fiber issue, not a CCD issue). 487 is the worse with a transmission of ~0.1 after the shutdown and should be added to the list of 'broken' fibers even though it is not broken.

I see also fibers with sharp reduction of transmission in other cameras!

julienguy commented 1 year ago

Actually this had been studied and reported at the time: see https://desi.lbl.gov/DocDB/cgi-bin/private/ShowDocument?docid=6460 . I am adding fiber 3987 to the list of 'broken' fibers because of its low transmission (0.1 of the average).

sbailey commented 1 year ago

I believe this ticket has been addressed through a combination of updated calibration files and PR #70 . The originally reported failures have been regenerated. Closing.