desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
36 stars 24 forks source link

exposures-jura contains some NaN values #2333

Closed weaverba137 closed 2 months ago

weaverba137 commented 2 months ago

While working on #2251, we discovered that the top-level exposures-jura files, specifically the exposure tables in those files contained some NaN values.

This list of exposures should be all of those that contain NaN. You can see that some of the columns are masked.

 NIGHT   EXPID  TILEID  TILERA TILEDEC ...   SKY_MAG_Z_SPEC        EFFTIME_GFA          EFFTIME_DARK_GFA    EFFTIME_BRIGHT_GFA EFFTIME_BACKUP_GFA
 int32   int32  int32  float64 float64 ...      float64              float64                float64              float64            float64      
-------- ------ ------ ------- ------- ... ------------------ ---------------------- ---------------------- ------------------ ------------------
20210202  74307  80679   111.0    41.5 ... 19.993767541215583                     --                     --                 --                 --
20210308  79769  80731  192.86   27.13 ... 19.575426801000127                     --                     --                 --                 --
20210404  83420  80697   145.0  32.375 ...  19.65692757225773 1.0075802546981036e-05 1.0075802546981036e-05                 --                 --
20211125 110852  42262   5.991   4.792 ... 19.157658679585744                     --                     --                 --                 --
20230815 190752  40203 320.573  26.421 ... 19.386191940942705                     --                     --                 --                 --

This is the list of columns in exposures-jura that contain NaN values:

TRANSPARENCY_GFA
SEEING_GFA
FIBER_FRACFLUX_GFA
FIBER_FRACFLUX_ELG_GFA
FIBER_FRACFLUX_BGS_GFA
FIBERFAC_GFA
FIBERFAC_ELG_GFA
FIBERFAC_BGS_GFA
EFFTIME_GFA
EFFTIME_DARK_GFA
EFFTIME_BRIGHT_GFA
EFFTIME_BACKUP_GFA

The origin of these invalid values should be investigated at high priority in preparation for kibo. As part of the investigation, we need to look into the tsnr_afterburner to see where these columns come from.

We also need to check the FRAMES table for invalid values (part of exposures-jura.fits).

Attn: @sbailey, @araichoor, @akremin et al..

araichoor commented 2 months ago

thanks for initiating that. some complementary informations:

in short: these five exposures are poor-quality data from sv1 or mainbackup), where the gfa pipeline returns NaN's for some reason. I ve not investigated why, not sure it s worth it. can t we leave NaN's? if not, we could patch with values like -99, which clearly is a non-physical value for those quantities?

in details:

about the expids: these five exposures are poor-quality data; from jura:

EXPID   NIGHT    FAFLAVOR       EXPTIME            EFFTIME_SPEC    
------ -------- ---------- ------------------ ---------------------
 74307 20210202  sv1elgqso    566.04931640625    1.3787630796432495
 79769 20210308     sv1ssv 300.06988525390625 5.455439168144949e-05
 83420 20210404     sv1elg   80.4530029296875 2.655843309185002e-05
110852 20211125 mainbackup  601.6654663085938    17.180953979492188
190752 20230815 mainbackup  603.0149536132812    0.6969366669654846

about the columns: those all are gfa columns. I believe that these columns are propagated from reading the gfa files (here code version 0.63.0, the one from jura): https://github.com/desihub/desispec/blob/9101ccc361d431625aca833336bb835a6292d881/bin/desi_tsnr_afterburner#L1115-L1154 the EFFTIME_* columns come from some computation based on those columns, hence getting NaN's: https://github.com/desihub/desispec/blob/0.63.0/py/desispec/efftime.py#L13

for completeness, reading the gfa_table with the same routine, and looking for those columns for those expids:

EXPID   NIGHT       TRANSPARENCY         FWHM_ASEC         FIBER_FRACFLUX    FIBER_FRACFLUX_ELG  FIBER_FRACFLUX_BGS         FIBERFAC            FIBERFAC_ELG          FIBERFAC_BGS     
------ -------- -------------------- ------------------ ------------------- ------------------- -------------------- --------------------- --------------------- ----------------------
 74307 20210202 0.059126717922222294 0.9789517068362219  0.6320627694913908 0.44970044302469636  0.20504407421980816                   nan                   nan                    nan
 79769 20210308                  nan 1.8761397170576326  0.2613052900585784 0.21108856454109742  0.10411062155477979                   nan                   nan                    nan
 83420 20210404                  nan                nan                 nan                 nan                  nan 0.0007895831162312318 0.0006804336788210475 0.00030002771194605096
110852 20211125   0.8780836223539111   3.20778365614237 0.11836103479224255 0.10716889904754712 0.057902263978654445                   nan                   nan                    nan
190752 20230815  0.04454787890451631 1.0542636639223057   0.572109104821019  0.4136182060095519  0.19035298877013784                   nan                   nan                    nan
araichoor commented 2 months ago

and about the FRAMES extension of exposures-jura.fits: I don t find any NaN's in there.

I did the following check:

d = Table(fitsio.read("/global/cfs/cdirs/desi/spectro/redux/jura/exposures-jura.fits", "FRAMES"))
for key in d.colnames:
     if isinstance(d[key][0], str):
        continue
     sel = ~np.isfinite(d[key])
     if sel.sum() > 0:
        print("{}\t{}".format(key, d["EXPID"][sel].tolist()))
sbailey commented 2 months ago

IIUC all NaN values come from the GFA pipeline when its outputs are merged into to the exposures table. I don't think any of these are coming from the spectro pipeline itself. Options:

  1. fix upstream and rerun so that the GFA pipeline doesn't produce NaNs. This is best if possible, but might not be viable for either technical or timescale reasons for Kibo.
  2. Intercept in the tsnr_afterburner and replace with dummy values (e.g. -99); be careful to also update any exposures -> tiles summary statistics to filter out these dummy values so that they don't mess up e.g. sum(EFFTIME_GFA).
  3. Leave the NaNs as is in the files, and intercept and replace with dummy values when loading the DB; arguably a NaN is a valid value for missing data.

In practice these GFA quantities are mostly informative and used for operations debugging since the GFAs are used for the exposure time calculator. i.e. if we're got a quick and solid fix, ok, but I don't think this should delay Kibo. Also, if we confirm that all the NaNs are coming from GFA pipeline inputs being merged into the exposures table, that could be fixed and rerun while Kibo is running, since the exposures table isn't generated until after all of the tile-based processing is done.

weaverba137 commented 2 months ago

Thank you for contributing to this discussion. This is getting interesting. Here are some preliminary findings:

  1. For patching these values in jura, the best value to use is zero (0.0). Why?

    • The NaNs are masked and thus the sum over them treats them as zero. This was probably not the intended behavior of that code snippet, but that is what actually happened in jura, so we should just use that value.
    • Zero is perfectly valid for loading the database.
  2. What are we actually patching?

    • The idea was to patch missing values in daily with values in jura, then it was discovered that some values in jura are bad.
    • I do not propose changing any existing files in jura, only applying a patch to daily.
  3. Why didn't we find this before?

    • The exposures listed above appear in both fuji and iron. However, in both fuji and iron the GFA quantities associated with all of the exposures for the corresponding tiles are set to zero (another reason zero is a good patch value!).
    • Somehow the interpretation of the GFA files changed between iron and jura. The GFA file itself is the same: desi/survey/GFA/offline_matched_coadd_ccds_SV1-thru_20210928.fits which was frozen in fuji. There are definitely NaN values present in that file. Somehow, the interpretation of bad GFA values changed between desispec/0.56.5 (iron) and desispec/0.63.x (jura).
  4. What should we do for kibo?

    • Determine what changed in interpreting GFA data between iron and jura and decide (in progress). I vote for replacing NaN with zero explicitly--with logging and documentation--in kibo.
    • Issue a warning in compute_tile_completeness_table() when it sums over masked values, so we can at least know that is happening.
  5. What else?

    • desi_tiles_completeness effectively does the same thing as desi_tsnr_afterburner --tile-completeness, but does not share any code. These could diverge in the future. It is not obvious why desi_tiles_completeness is even needed. That should be a separate ticket, of course, and is not necessarily needed for kibo.
    • The processing of desi_tsnr_afterburner should be standardized. The outputs and log files are literally all over the place, which is not good practice.
weaverba137 commented 2 months ago

Follow up results:

araichoor commented 2 months ago

thanks for the investigation. if useful, a minor precision about the sv1, sv3 gfa files: the namings are a bit misleading; I believe that the sv3 file contains all exposures taken since the start of sv3 (20210405), which do include some sv1 exposures (approx. half of the sv1 exposures). actually the offline_matched_coadd_ccds_SV3-thru_20240409.fits file probably contains all exposures since 20210405.

>>> d1 = Table(fitsio.read("/global/cfs/cdirs/desi/survey/GFA/offline_matched_coadd_ccds_SV1-thru_20210928.fits", 2))
>>> d3 = Table(fitsio.read("/global/cfs/cdirs/desi/survey/GFA/offline_matched_coadd_ccds_SV3-thru_20240409.fits", 2))

>>> d1["NIGHT"].min(), d1["NIGHT"].max()
(20201214, 20210928)
>>> d3["NIGHT"].min(), d3["NIGHT"].max()
(20210405, 20240409)

>>> np.in1d(d1["EXPID"], d3["EXPID"]).mean()
0.48369036027263873
>>> np.all(~np.in1d(d1["EXPID"][d1["NIGHT"] < d3["NIGHT"].min()], d3["EXPID"]))
True
>>> np.all(np.in1d(d1["EXPID"][d1["NIGHT"] >= d3["NIGHT"].min()], d3["EXPID"]))
True
weaverba137 commented 2 months ago

OK, I should have said "many early" SV1 exposures rather than "all" SV1 exposures, but the important thing here is that two GFA summary files must be read for completeness, and one of them will continue to contain the label "SV1".

weaverba137 commented 2 months ago

We also now know the reason for the difference between iron & jura:

When iron was run, it used the directory that is now called desi/survey/GFA.KPNO, but at the time, it was just desi/survey/GFA. That directory did not (and does not) contain a copy of a SV1 GFA summary file, so all of the exposures that appear only in the SV1-labelled GFA summary files had their values set to zero. This had nothing to do with the presence of NaN in GFA summary files at all. The necessary file just wasn't there. This also happened for fuji.

jura used desi/survey/GFA.NERSC (which is the current desi/survey/GFA via a symlink), which does contain a SV1 GFA summary file, so we then "discovered" the NaNs for exposures that had previously just been set to zero.