desihub / desispec

DESI spectral pipeline
BSD 3-Clause "New" or "Revised" License
36 stars 24 forks source link

Refactor read_gfa_data function #2252

Closed weaverba137 closed 4 months ago

weaverba137 commented 4 months ago

This PR closes #2250.

weaverba137 commented 4 months ago

@sbailey, please take a look at desispec.tilecompleteness.number_of_good_redrock. I don't think that function ever would have worked, in part due to an import error, but also it is not used by anything at all in desispec.

araichoor commented 4 months ago

sorry I m late here; I don t have time right now to dive in the details of the PR and the related issues, neither to dig into the data. if my comment is relevant, apologies in advance, just discard it!

my remark, with just reading the PR comment:

for the daily operations, it shouldn t be a problem, as earlier (including SV1, SV2) GFA info already is in the exposures-daily.csv file, so the code will just look for GFA info for the latest exposures, which will be in the offline_matched_coadd_ccds_SV3-thru_{NIGHT}.fits.

but for a production, e.g. jura, which will reprocess all exposures since SV1, it sounds to me that the code won t find the relevant GFA info files for pre-SV3 exposures.

I see those two "latest" files:

ls -l /global/cfs/cdirs/desi/survey/GFA_orig/offline_matched_coadd_ccds_SV1-thru_* | tail -n 1
-r--r--r-- 1 ameisner desi 53864640 Sep 29  2021 /global/cfs/cdirs/desi/survey/GFA_orig/offline_matched_coadd_ccds_SV1-thru_20210928.fits

ls -l /global/cfs/cdirs/desi/survey/GFA_orig/offline_matched_coadd_ccds_SV2-thru_* | tail -n 1
-r--r--r-- 1 ameisner desi 29888640 Sep 29  2021 /global/cfs/cdirs/desi/survey/GFA_orig/offline_matched_coadd_ccds_SV2-thru_20210928.fits

looks like the SV1 file contains all exposures from the SV2 file, ie we may just need /global/cfs/cdirs/desi/survey/GFA_orig/offline_matched_coadd_ccds_SV1-thru_20210928.fits for pre-SV3 exposures. as this exposure list static, I guess it s fine to hard-code this path (unless we do at some point a whole reprocessing of the GFA data).

weaverba137 commented 4 months ago

Do we know specifically how this was done for fuji, guadalupe and iron? In other words, did these read from /global/cfs/cdirs/desi/survey/GFA_orig or /global/cfs/cdirs/desi/survey/GFA?

sbailey commented 4 months ago

Iron used /global/cfs/cdirs/desi/survey/GFAon Feb 8 2023. That path is currently a symlink to GFA.NERSC, created on Feb 16 2024, i.e. created a year after Iron was run. GFA_orig is a symlink, also created on Feb 16 2024, to ../users/ameisner/GFA/conditions. I was expecting GFA_orig to point to what we used for Iron, but that ameisner/GFA/conditions directory only goes though offline_matched_coadd_ccds_SV3-thru_20211013.fits, which is different from what we cached in /global/cfs/cdirs/desi/public/dr1/survey/GFA/offline_matched_coadd_ccds_SV3-thru_20220613.fits.

i.e. I'm not sure what we actually used for iron and prior.

weaverba137 commented 4 months ago
========= ================================================= ===== ====== =================
Location  Filename                                          First Last   Comment
========= ================================================= ===== ====== =================
GFA_orig  offline_matched_coadd_ccds_SV1-thru_20210928.fits 67678 102108 Last SV1 file
GFA_orig  offline_matched_coadd_ccds_SV2-thru_20210928.fits 81831 102108 Last SV2 file
EDR       offline_matched_coadd_ccds_SV3-thru_20210927.fits 83522 101996 Official EDR file
DR1       offline_matched_coadd_ccds_SV3-thru_20220613.fits 83522 139653 Official DR1 file
GFA.NERSC offline_matched_coadd_ccds_SV3-thru_20240409.fits 83522 235204 Most recent file
========= ================================================= ===== ====== =================

Now for some set theory:

So EDR < DR1 < Most recent, in the set theory sense. But some exposures from the SV1 file are missing. The very slight difference in Last Expid between SV1 & SV3 may simply be a matter of the one night difference 20210928 versus 20210927.

Before we consider anything else, we may want to add offline_matched_coadd_ccds_SV1-thru_20210928.fits or offline_matched_coadd_ccds_SV1-thru_20210927.fits to EDR and DR1 (i.e. align the last exposures), so that we have GFA files that cover all exposures in those releases.

weaverba137 commented 4 months ago

A possible way to rewrite the function read_gfa_data() would be:

  1. Read the SV1 file. There will only ever be one of them at this point.
  2. Concatenate the most recent GFA summary file, ignoring any label, considering only the last night.
sbailey commented 4 months ago

Thanks @weaverba137 for digging into the files and the set theory. My understanding of the situation now:

Options:

weaverba137 commented 4 months ago

Re-add offline_matched_coadd_ccds_SV1-thru_20210928.fits immediately, then in this PR revert read_gfadata() to the "stack the latest SV1,SV2,SV3-thru????????.fits file" method, while keeping the additional unit tests of this PR (updated to match the logic if needed).

I think my proposal above is a simplified version of that. Would that be acceptable?

sbailey commented 4 months ago

Yes, I think that would be fine, with the inclusion of actually adding the SV1 file back to the GFA directory as well. This PR already has the code for sorting by night regardless of what future SURVEY prefixes may exist, so the only change is to also read the latest SV1 file as a special case and concatenate the two. Good.

weaverba137 commented 4 months ago

OK, I'll probably have to update that on Monday at this point, and I'll take care of adding in the SV1 file.

weaverba137 commented 4 months ago

I can also add some of your explanations above to the documentation.

weaverba137 commented 4 months ago

For the SV1 file: offline_matched_coadd_ccds_SV1-thru_20210405.fits would have the minimal overlap with any current file. The last EXPID is 83543, and the first expid in SV3 and later is 83522. However, given the intermingling of SV1 & SV3, perhaps we really need to take the last SV1 file, and just accept the overlap. Thoughts?

weaverba137 commented 4 months ago

Based on previous versions of this function, which would have been used for fuji and iron, the file offline_matched_coadd_ccds_SV1-thru_20210928.fits would have been selected, so we should retain that same file, even if there is overlap with later files.

weaverba137 commented 4 months ago

offline_matched_coadd_ccds_SV1-thru_20210928.fits is now visible in EDR, DR1 and the GFA directory for daily processing.

weaverba137 commented 4 months ago

I think this is ready for testing and review now.

weaverba137 commented 4 months ago

@sbailey, just a reminder to test this once we start getting new GFA data.

sbailey commented 4 months ago

Summarizing a chat with @weaverba137

To resolve this, I have made a new branch "jura" which we can use for any last minute Jura updates + tags, thus opening up "main" for daily ops again, including this PR. Until we are completely done with Jura, we will need to carefully consider each PR for whether it should be to main or jura. The basic workflow rule is that we can merge PR -> jura -> main, or PR -> main, but not PR -> main -> jura because that could bring in other changes on main that we don't want in jura tags.