zcatalog column for which spectrum is best for each TARGETID

sbailey commented 3 years ago

For redshift catalogs where the same TARGETID appears more than once, add column for identifying which one is "best". Even for catalogs with only one entry per TARGETID consider adding the column for format consistency.

Unfortunately we're not converging on getting this done for Everest, so will revisit for Fuji.

sbailey commented 3 years ago

Email notes from @djschlegel :

Here was the logic for the general-use case of “what’s the best spectrum” in the SDSS roll-up catalog (not for use in clustering analyses):

The SPECPRIMARY output element is used to select a unique set of objects in the case of duplicate observations. Any objects observed multiple times will have SPECPRIMARY=1 for one instance only, and =0 for all other instances. The criteria (in order of importance) are as follows: 1) Prefer observations with positive SN_MEDIAN in r-band 2) Prefer PLATEQUALITY='good' over any other plate quality 3) Prefer observations with ZWARNING=0 4) Prefer objects with larger SN_MEDIAN in r-band

For DESI, we could retain the same SPECPRIMARY column name. Here are suggested criteria in order of importance: 1) Prefer a good observation, where good are the same bits considered by MTL updates 2) Prefer observations with ZWARN=0 3) Prefer observations with higher TSNR, always using TSNR2_ELG; if we want to use different TSNRs, that would have to be chosen per-target, for ex. use TSNR2_BGS for targets that are explicitly BGS targets

I’ll note that I don’t currently see any sources appearing multitple times with different TARGETIDs in the file desi/spectro/redux/denali/zcatalog-denali-cumulative.fits That will happen in the future with some secondary targets that are coincident with main survey targets.

stephjuneau commented 3 years ago

About combining redshift or at least adding a SPECPRIMARY column, do others agree that we should do this either per-tile or per-healpix? I was first thinking it would be useful to generate both sets of combined redshift catalogs, but I can also see that it could potentially introduce confusion if the column had the same name but would correspond to slightly different spectra. Also, in the email from David S copied above, David proposed using a single TSNR value. In light of the email thread [desi-data 5646] EFFTIME_SPEC from TSNR2_LRG instead of TSNR2_ELG, should we (can we) use TSNR_LRG for all target types?

andreufont commented 3 years ago

argh, I accidentally closed the issue, reopening now

andreufont commented 3 years ago

I'm a very clumsy typer, sorry I accidentally closed the issue.

What I wanted to say is that I believe there are very few cases where someone will want to have a "TILE-PRIMARY" flag to identify the best tile within a survey. If you don't want to use the "per-healpix coadds" and want to look tile by tile (because you are studying quasar variability?), you probably don't want to use only a favored tile.

However, I could imagine that someone will want to have a "SURVEY-PRIMARY" flag, to identify which survey (sv1 / sv2 / sv3 / main, but also bright / dark) has the best observation of a particular target. In this case, I'd imagine that using some SNR to make the decision (or the decision pipeline described by David S) should be enough. In this case I'm describing, the catalog would not have columns like "NIGHT" or "TILEID", since these would make no sense (unless they are a concatenated string), but it would have the column SURVEY (or whatever keyword we use to distinguish sv1-dark or main-bright).

This would not be necessary if we had cross-survey coadds, but I could imagine how this could become a data model nightmare.

Ragadeepika-Pucha commented 3 years ago

@sbailey @djschlegel -- I am trying to create a master catalog of all SV with a SPECPRIMARY column. Initially, I was taking the primary spectra to be the "dark" spectra (in case of dark vs bright), and SV1 (in case of same targetids for SV1 and SV3). However, @stephjuneau suggested that I follow the above-mentioned logic. I have a few questions regarding that -

What are MTL updates? What are the bits used there? Are these different from the FIBERSTATUS?
Why are we considering ZWARN before TSNR?
Where can I find more information about TSNR2_*?

Thank you

djschlegel commented 3 years ago

The MTL updates are now using information fully captured in the ZWARN bits. So this should simplify the logic. I'd suggest always selecting an observation with ZWARN=0 over an observation with ZWARN!=0. Then, select the observation with the largest value of TSNR_LRG. The TSNR values are the predicted S/N for each fiber given the seeing, sky brightness, transparency and fiber throughput. See DESI-6110 and DESI-6455.

geordie666 commented 3 years ago

I agree with David in principle. You probably don't want to follow the exact choices MTL makes. But, if it's useful to know the specific bits in ZWARN that MTL ignores, they are: NODATA and BAD_SPECQA or BAD_PETALQA.

sbailey commented 2 years ago

@stephjuneau @Ragadeepika-Pucha please open a pull request to add your specprimary selection function to a new file py/desispec/zcatalog.py. It should take a table as input and return a boolean array of which rows should be considered the best. Please document exactly which columns are required in the input table.

Detail: the input should support input astropy Table, numpy structured arrays, and optionally pandas dataframes or even dict, i.e. anything that supports syntax like zcat['WARN']. If you need to use features like astropy Table join syntax, explicitly cast to that type within the function rather than requiring that it be the input type.

Followup from a discussion with @stephjuneau, I think we will want two flags (names TBD):

SPECPRIMARY: best spectrum for this TARGETID within this data release, regardless of SV1/SV2/SV3 or DARK/BRIGHT (@stephjuneau 's original use case)
PROGPRIMARY: best spectrum for this TARGETID within a given SURVEY+PROGRAM, e.g. keeping SV1-DARK separate from SV3-BRIGHT (similar to what @andreufont described)

Broadly speaking I think astronomy-oriented users interested in individual objects will want SPECPRIMARY from a single rollup catalog, while cosmology-oriented users would be more interested in PROGPRIMARY because sample homogeneity is more important for those analysis. E.g. a BGS clustering analysis might not want some portion of the BGS targets to be treated differently just because they happened to pass LRG cuts and some subset of those got deeper DARK time exposures.

Operationally, currently desi_zcatalog takes a list of tiles pre-split by SURVEY+PROGRAM or a healpix/SURVEY/PROGRAM base directory to make files like zpix-main-bright.fits, zpix-sv3-dark.fits, ztile-sv1-backup-cumulative.fits, sv3-dark-pernight.fits etc. The challenge is that at the time it is making zpix-sv3-bright.fits, it hasn't read in all the dark tiles. That's fine for PROGPRIMARY, but doesn't work for SPECPRIMARY since it doesn't know if a better spectrum is out there. To implement SPECPRIMARY in the final files I think this implies that we have to load all the tiles at the start, determine SPECPRIMARY, and then start filtering by SURVEY and PROGRAM to select PROGPRIMARY and write out the per SURVEY-PROGRAM catalogs.

Further complicating the situation: what about tile-based pernight vs. cumulative? Do we mean best spectrum within that particular grouping, or by fiat declare the cumulative is always better than pernight (even when they are actually the same) and thus never set SPECPRIMARY=True for pernight? Or something else?

This reminds me work-on-pause defining a unique SPECID, where TARGETID would be the unique input target, and SPECID would be a unique identifier for a particular processed spectrum taking into account the many flavors of coadds (e.g. a cross-tile healpix coadd of a given TARGETID would get a different SPECID from a per-tile per-night coadd of that same TARGETID). Defining that turned out to be messy...

andreufont commented 2 years ago

Hi @sbailey - in terms of Lya BAO (KP6), I expect that none of this would matter much. We will always use the per-healpix coadds, and we will only use dark-main data.

The situation is different for ongoing analyses of SV Lya with Everest / Fuji, where we will want to combine sv1, sv2, sv3 and main. I do not expect we would ever use bright time data, unless someone shows that is worth the trouble (I doubt it). I do not expect anyone to want per-tile quasar catalogs, so I think we can focus on dark time catalogs in healpix data format.

In this sense, all we need is a function / logic to decide which of the per-survey coadds (dark-sv1, dark-sv2, dark-sv3 or dark-main) are better, whenever a given target has been observed in more than one dark survey. This could be a column added to some of the files, but in our workflow we could also use a documented function instead (this might not be the case in Stephanie's workflow).

Ragadeepika-Pucha commented 2 years ago

I created the zcatalog.py file here - https://github.com/desihub/desispec/blob/raga_zcat/py/desispec/zcatalog.py I have not created the pull request yet. The function find_primary_spectra takes a table as input, and returns nspec and spec_primary - which denote the number of spectra of each source and the best spectra out of them for the sources in the table. Ideally, it can be used for both the SPECPRIMARY and PROGPRIMARY described by @sbailey. Please test the function and let me know if any changes/improvements need to be made, before I create the pull request. Thank you.

sbailey commented 2 years ago

@Ragadeepika-Pucha thanks for this work and apologies for the delayed reply. I tested that the code is fast enough. Please make the following minor updates and then proceed with a pull request:

Algorithmically:

This line tsel.sort(['TARGETID', 'ZWARN', 'INV_TSNR2_LRG']) treats lower ZWARN values as better than larger ZWARN values, when I think we just want to sort on ZWARN=0 vs. ZWARN!=0 and then let INV_TSNR2_LRG absorb the rest of the ranking.
There is a chance that TSNR2_LRG could be 0.0 for some targets, so update tsel['INV_TSNR2_LRG'] = 1/tsel['TSNR2_LRG'] to be robust to that without generating a divide-by-zero warning, e.g. with something like tsel['INV_TSNR2_LRG'] = 1/(tsel['TSNR2_LRG'] + 1e-99*(tsel['TSNR2_LRG']==0.0))
consider adding an option to override the choice of what column is used for TSNR2 sorting. e.g. someone with a stellar or QSO catalog might want to use something other than TSNR2_LRG for selecting their best targets within their catalog, perhaps even a non-TSNR2 column.

Maintainability:

in the docstring, document what columns are required in the input table. Describe that table as a "redshift catalog" and not just "input table".
in the docstring, add a brief description of how the "best" spectrum is selected
it looks like the code requires a "Z" column due to tsel = table['TARGETID', 'Z', 'ZWARN', 'TSNR2_LRG'], though I think it doesn't actually use "Z" in the selection, and thus it could be dropped (I think; if it is needed that is fine).
The return values are current astropy Columns instead of pure numpy arrays. Please cast them back to pure numpy arrays before returning.

Ideally this would also come with a set of unit tests that would create a fake zcat, pass it through find_primary_spectra, and confirm that you got the expected results for a variety of cases (TARGETID appearing multiple times, targets with various values of ZWARN and TSNR2LRG, etc.). See examples in py/desispec/test/test*.py, e.g. test_frame.py. This is primarily about protecting someone from accidentally breaking it in the future, though it was the processes of thinking about unit tests that made me realize that (I think...) this code would pick ZWARN=4,TSNR2=1 as better than ZWARN=8,TSNR2=100 when I think we'd rather have the TNSR2=100 target given that both have a ZWARN flag (the bits aren't ranked by badness). Unit tests are a good way to check cases like that.

FYI @stephjuneau

sbailey commented 2 years ago

Update: PR #1609 provided desispec.zcatalog.find_primary_spectra. Now we need to actually use that when generating redshift catalogs. That is "easy" for the case of flagging the best spectrum within a catalog, but requires more work for the full vision of having the flag apply to all targets in a data release, since that required assembling multiple zcatalogs at the same time (e.g. sv1,sv2,sv3 / bright,dark...)

sbailey commented 1 year ago

Implemented in desispec.zcatalog.create_summary_catalog. Closing ticket.

desihub / desispec

zcatalog column for which spectrum is best for each TARGETID #1355