Target selection for the first MWS Stream Program (GD1)

geordie666 commented 6 months ago

This PR adds code for the first MWS stream program (GD1). It builds on the work provided by @segasai's as part of Issue #812.

As a side-effort, this PR also updates any units that are written to files produced by desitarget to be FITS-compliant.

This is still work in progress, but the code to create the actual target files is in place.

Remaining work before merging this PR includes:

Incorporating the external dependency on gaia_zeropoint mentioned in Issue #812.
Determining how to merge the targets with the existing targets in the MTL ledgers.
Adding final priorities, and numbers of observations, for the GD1 targets (once they're determined).

Possible additional work might include:

Adding units for some of the new Gaia quantities that haven't previously been in the DESI data model.
Adding code to read the LS sweeps (and match to Gaia DR3) per-sweep rather than per-stream. This would be useful if we extend targeting to multiple stream programs in the future.

@segasai and/or @sazabi4. I posted this PR before completely finishing my implementation in the hope that you could check my working. Could you please:

Check that the cache in /global/cfs/cdirs/desi/target/catalogs/streamcache/gd1-dr9-cache.fits is in agreement with @segasai's database-derived cache.
- In particular, that the Gaia quantities seem correctly matched to the Legacy Surveys sources.
Check that the targets in /pscratch/sd/a/adamyers/dr9/2.7.0.dev5481/streamtargets/main/resolve/bright/streamtargets-gd1-bright.fits seem reasonable. The different classes are stored as bit-masks in the SCND_TARGET column, with the following interpretations:
- Bit-nameGD1_BRIGHT_PM, bit-number 2**55, "Bright targets with Gaia in GD1 stream"
- Bit-name GD1_FAINT_NO_PM, bit-number 2**56, "Faint targets without Gaia in GD1 stream"
- Bit-name GD1_FILLER, bit-number 2**57, "Filler targets in GD1 stream"
Try to determine whether we can make the cache smaller. Currently, the read_data() function provided with Issue #812 produces ~113M sources. This is manageable, but large.

coveralls commented 6 months ago

coverage: 53.312% (-2.6%) from 55.87% when pulling 6b7b7df1606e3df262909967866c47e5f63d856e on ADM-stream-MWS into 9e3bcd17c630608ae4173a4671e51325b2b8db50 on main.

segasai commented 5 months ago

Hi @geordie666,

I've just started looking at the gd1-cache file and seeing a large (81 mill) number of sources with RA,DEC=0 (without looking in detail of the code, that is probably not right ? Based on table content these look like sources without Gaia info, so probably for those source their RA,DEC was not substituted by LS RA, LS DEC?

Also the cache file includes the stars with dec<-30. in my original query I was cutting at dec=-20. Presumably we should do the same thing here ?

Thanks!

segasai commented 5 months ago

Doing some further checks of the cache catalogue.
Currently the catalogue has sources with some gaia magnitudes set to zero i.e. G=0 and not other RP=20.8. From my previous experience with various LS/inputs catalogs setting values to zero seems to be standard way of dealing with Nulls, but I personally find this extremely error-prone and inviting for bugs (I.e. having PMRA=0 as null value ... ) . So I am wondering if maybe at least for the purposes of this xmatch, we could have a catalog where sources without PM or G mag has it as NaN. In that case we can be confident that there will be no sources that by mistake satisfy selection criteria. (the way I found this issue is I applied G<20 selection to both my local database gaia-LS xmatch and saw the discrepancy in numbers with the cache file here).

(It is possible that there are fundamental reasons why this change is impossible, that I am missing)

geordie666 commented 5 months ago

Doing some further checks of the cache catalogue. Currently the catalogue has sources with some gaia magnitudes set to zero i.e. G=0 and not other RP=20.8. From my previous experience with various LS/inputs catalogs setting values to zero seems to be standard way of dealing with Nulls, but I personally find this extremely error-prone and inviting for bugs (I.e. having PMRA=0 as null value ... ) . So I am wondering if maybe at least for the purposes of this xmatch, we could have a catalog where sources without PM or G mag has it as NaN. In that case we can be confident that there will be no sources that by mistake satisfy selection criteria. (the way I found this issue is I applied G<20 selection to both my local database gaia-LS xmatch and saw the discrepancy in numbers with the cache file here).

(It is possible that there are fundamental reasons why this change is impossible, that I am missing)

I think the way we've always accomplished this in the past is to include an IVAR. And if the IVAR is 0 that means the source is unmatched, because the error is infinite. In the case of magnitudes, we just never want to target anything anywhere as bright as 0.

I'd need an official opinion from @sbailey, but I think we try very hard not to have NaN values in any of our catalogs. And PM would definitely propagate downstream as NaN.

segasai commented 5 months ago

I understand that is the way LS is done. I also understand that it may be too late/impossible to do about it at this point. My main concern is that it is done to Gaia data, directly contradicting the official Gaia data. But IMO that is a valid concern, since any selection I will do on the official gaia data will give something different here.

geordie666 commented 5 months ago

Perhaps I could allow NaNs only in the cache file? Would that work? When I write it out to an actual target file I could change the NaNs back to the Data Systems standard.

geordie666 commented 5 months ago

Actually, even the Gaia files to which I'm cross-matching already have the NaNs removed. I could, though, substitute values of NaN wherever there is a value of PMRA_ERROR or PMDEC_ERROR = 0. I could also set values of PHOT_G_MEAN_MAG to something other than 0 for the purposes of the cache. Let me know if that's critical and I can back-engineer that.

segasai commented 5 months ago

Thanks @geordie666 (& sorry I didn't mean to create more work for you).

IMO the best solution is to have the Gaia files closer to 'original' (i.e. with nans in place), and only substitute the nans with zeros when writing the end product.
the second best is to do substitution that you suggested. I personally think it is important to have the selections that would work with original gaia data.

geordie666 commented 5 months ago

I've updated my creation of the cache file to cut at dec > -20o and to correctly use LS information from LS and Gaia information from Gaia (i.e. there are no RA, DEC = 0 objects anymore). I also corrected the pm12_sel_func call as suggested.

There's an updated cache file at /global/cfs/cdirs/desi/target/catalogs/streamcache/gd1-dr9-cache.fits and an updated file of targets at /pscratch/sd/a/adamyers/blat/dr9/2.7.0.dev5481/streamtargets/main/resolve/bright/streamtargets-gd1-bright.fits

Something is still wrong, though, as there are now overlapping targets in the bright_pm and faint_no_pm target categories. And I recover a lot of targets (10 million).

@segasai: I'll work on doing the substitutions for NaNs into the Gaia cache. To help me out with debugging the target selection in general, though, could you please:

Put a FITS version of your cache file somewhere at NERSC so I can try to debug my cache.
Try running your cache file through the desitarget.streams.cuts.is_in_GD1(objs) function (where objs is the result of reading in your cache) to see if I have a wider bug in the selection. You might have to do a little manipulation as my cache expects IVARs instead of errors. I'll also fix that when I'm rewriting the cache file to resemble yours.

segasai commented 5 months ago

Thanks @geordie666 Here's my cache file: /pscratch/sd/k/koposov/gd1cache/gd1cache_tonersc.fits

I'll try to do 2) In the previous iteration -- I've found the proper motion selection bug, because the number of targets didn't look at all like my selection -- here's my output selection file (formatted like a standard tertiary program file) /pscratch/sd/k/koposov/gd1cache/gd1whole-240109.fits so you can also check against that.

segasai commented 5 months ago

while trying 2) I noticed that the streams and zeropoint folders lack init.py ( I can't import after install without those)

When comparing selections I did this (where X is my cache table) :

import desitarget.streams.cuts
X1={
   'PARALLAX_IVAR':1./X['parallax_error']**2,
   'PMRA_IVAR':1./X['pmra_error']**2,
   'PMDEC_IVAR':1./X['pmdec_error']**2,
    'TYPE':np.array(['PSF']*len(X['ra']))
   }

for k in ['astrometric_params_solved',
          'phot_g_mean_mag',
          'nu_eff_used_in_astrometry',
          'pseudocolour',
          'ecl_lat','ebv','flux_g','flux_r','flux_z',
         'parallax','pmra','pmdec','ra','dec'] :
    X1[k.upper()]=X[k]

xsel=desitarget.streams.cuts.is_in_GD1(atpy.Table(X1))

And here's the comparison for my selections ( left is the current desitatarget code, right is my selection, rightmost overlap)

print(xsel[0].sum(), cat1_sel.sum(), (xsel[0]&cat1_sel).sum())
print(xsel[1].sum(), cat2_sel.sum(), (xsel[1]&cat2_sel).sum())
print(xsel[2].sum(), cat3_sel.sum(), (xsel[2]&cat3_sel).sum())

BRIGHT_PM 54068 58411 53594 FAINT_PM 0 81473 0 FILLER 4799044 1286806 1286598

So the main selection is close (but not quite right), FAINT_PM is not, FILLER has too many things.

segasai commented 5 months ago

Something is still wrong, though, as there are now overlapping targets in the bright_pm and faint_no_pm target categories. And I recover a lot of targets (10 million).

Regarding this -- this is a consequence of the '0+/- inf' pm convention. The selection in pms i.e. gaia_astrom_sel is supposed to only select objects with pm measurements. But as it's written it will select objects with out pm as well as they satisfy the criteria of |0-pmtrack|<10^8 A quick short term fix is to add the PMRA!=0 to gaia_astrom_sel.

geordie666 commented 5 months ago

To let people know where I am on this: While @segasai was tracking down the selection bug (thanks!) I've written the code to add the new stream targets to the MTLs. That was the bulk of my recent push.

I'm now going to go back to the actual selection of the stream targets to try to align my selections as closely as possible with @segasai's.

geordie666 commented 4 months ago

@segasai: I think I'm pretty close. I substituted NaNs back into the cache and modified the proper motion cuts to match your original code, and the targets look very reasonable.

Our caches:

You (/pscratch/sd/k/koposov/gd1cache/gd1cache_tonersc.fits) Me (/global/cfs/cdirs/desi/target/catalogs/streamcache/gd1-dr9-cache.fits)

agree to ~1%. I'm not sure why there are differences. I noticed that your version of the cache contains some duplicate source_id values, but that only comprises ~200,000 sources, so is insufficient to explain all of the differences.

Our targets:

You (/pscratch/sd/k/koposov/gd1cache/gd1whole-240109.fits) Me (/pscratch/sd/a/adamyers/blat/dr9/2.7.0.dev5481/streamtargets/main/resolve/bright/streamtargets-gd1-bright.fits)

Agree to within ~100 objects.

If you're willing to have another go at tracking down differences, please do so. But, maybe you're happy that this is all close enough?

segasai commented 4 months ago

Thanks @geordie666 I don't think the differences are big enough to worry too much, but I still prefer not to have something unexplained. Looking at the sources that are in my cache catalogue but not in yours shows a lot of differences near the edges of the selection/footprint: here's the ra,dec histogram for sources in my catalogue but not in yours:

I don't quite know what could be the cause of that. Presumably that's the edge of the legacy footprint.

Another issue I think is in the gaia matching. The code here: https://github.com/desihub/desitarget/blob/d5e864372504b9af461de32e3a95696a578a2e44/py/desitarget/gaiamatch.py#L1499

seems to find all possible pairs with <1" separation, but I think it needs to select the closest one. A common thread I see are disappearing objects that have a neighbor -- i.e. 151.80888 26.4963 is in my 'cache' but not yours.

geordie666 commented 4 months ago

@segasai: Your example case (151.80888, 26.4963) and, it seems, ~90% of the discrepancies between our caches, were caused by me imposing FLUX_G > 0 and FLUX_Z > 0 on objects in my cache.

I think this was, again, to avoid NaNs on my part when calculating MAG = 22.5 - 2.5*np.log10(FLUX).

I turned this off for the purposes of constructing the cache and our caches now agree to ~0.1%. The new version of my cache is again at /global/cfs/cdirs/desi/target/catalogs/streamcache/gd1-dr9-cache.fits

I'll keep digging into the other discrepancies.

geordie666 commented 4 months ago

@segasai: I think the final ~0.14% of discrepancies might be caused by this bug in the Legacy Surveys. I'm not 100% sure yet, but I thought I'd report back. If that is the case then our caches are both correct in some sense.

segasai commented 4 months ago

Thanks Adam. I think the cache files seem to have an identical number of sources and each source in my cat has a counterpart in yours, so I think the cache file is probably good. I haven't rechecked the actual selection.

geordie666 commented 4 months ago

The caches are now down to 17 total differences (out of 75M objects). These all appear to be due to floating-point precision issues.

Five of the differences are sources in the Legacy Surveys that the code in this PR deems to be (just) > 1 arcsecond from a Gaia source:

RA           Dec          REF_ID             Separation (")
129.51538014 25.32521806  702621263391967232 1.0014
168.12146167 50.09423204  838069207464032512 1.0288
146.65786853  19.5503847  627505240320862592 1.0670
263.2149366  43.75084479 1348697938003590016 1.0763
309.9488796    7.0563114 1748273485880947328 1.0066

Twelve of the differences are sources where the code in this PR (desitarget) finds a Gaia source that seems to be slightly closer to a Legacy Surveys source than for @segasai's code (the Sep here are separations from the Legacy Surveys coordinates in arcseconds):

     Legacy Surveys                     Match found by desitarget                           Match found by Sergey
RA           Dec         REF_ID              RA           Dec         Sep   REF_ID              RA           Dec         Sep 
140.25928713 23.99527373  639467235341444224 140.25923463 23.99525658 0.183  639467239635997184 140.25933539 23.99530400 0.193
145.81741152 25.36438824  646364750955477632 145.81733809 25.36441438 0.257  646364750955144448 145.81749552 25.36438190 0.274
153.95372618 32.49832218  746076196303063552 153.95364789 32.49821260 0.461  746076196303585024 153.95357288 32.49833868 0.469
312.91067273 -9.49986290 6902718094915517952 312.91056826 -9.49989348 0.387 6902718094914436096 312.91077170 -9.49981150 0.397
254.42544433 43.47463253 1358073267495512064 254.42544347 43.47458774 0.161 1358073267496824704 254.42543889 43.47467777 0.163
262.90505614 36.80445748 1336428723972968576 262.90504550 36.80437986 0.281 1336428728268867712 262.90506782 36.80453546 0.283
272.48468842 34.12077737 4605254704337803648 272.48472504 34.12080262 0.142 4605254704334578944 272.48465522 34.12074902 0.142
274.94303824 36.32641950 4605773673826916992 274.94304024 36.32638227 0.134 4605773669527696640 274.94303325 36.32645735 0.137
152.73643870 47.56873175  822537570633137408 152.73640099 47.56869718 0.155  822537574929143808 152.73647141 47.56878150 0.196
124.56700149  8.46278715 3098115117838471936 124.56701581  8.46271501 0.265 3098115113543498240 124.56699317  8.46286133 0.269
125.79839230  9.95574292  600378501758441088 125.79840695  9.95579900 0.208  600378501756996480 125.79839229  9.95567994 0.227
132.82343218 10.33717773  598561764951278336 132.82342648 10.33713803 0.144  598561764950747520 132.82346136 10.33720937 0.154

So, basically, the correspondence between caches should be as good as it reasonably can be, now.

segasai commented 4 months ago

Thanks for the updates and fixing the Gaia-xmatch bug @geordie666 With the cache files essentially identical, are the stream selections identical as well ?

geordie666 commented 4 months ago

I still see some differences in the stream selection. I'll dig into that next. It's likely related to choosing to leave NaNs in the cache file, now.

geordie666 commented 4 months ago

@segasai: There's a new version of the target file here:

/pscratch/sd/a/adamyers/blat/dr9/2.7.0.dev5481/streamtargets/main/resolve/bright/streamtargets-gd1-bright.fits

It's very close, now, to your version:

The faint sample is identical.
The bright sample has on extra target in my list, which is one of the rounding-error/matching cases where an object is ~1.00 arcseconds from a Gaia object.
Your version of the filler sample has an extra 16 targets that are, I think, removed by the desitarget checks on target brightness.

Let me know if you think this is good enough to proceed.

segasai commented 4 months ago

Thank you @geordie666 ! I think those numbers look good enough for me!

geordie666 commented 3 months ago

I think I've now finished the code for the end-to-end process of adding the GD1 stream targets. Previously I had:

Written a reasonable cache.
Written reasonable targeting files.

I've now also written code to add the new targets to the MTL ledgers. The overall process is as described in this thread on Slack. New targets are added to the secondary ledgers, and existing targets are merged with primary/secondary targets adopting the state of the highest-priority target.

An example of how adding the new targets would look after running on the current MTL ledgers is here:

/pscratch/sd/a/adamyers/mtlsandbox/mtl

And an example of "merging" a new and old target is:

grep 39627660212571777 /pscratch/sd/a/adamyers/mtlsandbox/mtl/main/bright/mtl-bright-hp-10232.ecsv | awk '{print $7, $8, $9, $10, $11, $15, $16, $24, $27}'

39627660212571777 2305843017803628544 0 1280 0.24369827786837128 0 2 MWS_MAIN_BLUE|UNOBS 1500
39627660212571777 6917529036231016448 0 1280 0.2004516496327986 36028797018963968 4 GD1_BRIGHT_PM|UNOBS 1520

where the printed columns are:

TARGETID DESI_TARGET BGS_TARGET MWS_TARGET SUBPRIORITY SCND_TARGET NUMOBS_MORE TARGET_STATE PRIORITY

I think I'm basically ready to merge and tag the code and process the new targets. Before I do that, though, we should finalize whether these are the correct priorities and numbers of observations for the GD1 stream targets:

https://github.com/desihub/desitarget/blob/ADM-stream-MWS/py/desitarget/data/targetmask.yaml#L420-L422 https://github.com/desihub/desitarget/blob/ADM-stream-MWS/py/desitarget/data/targetmask.yaml#L574-L576

geordie666 commented 3 months ago

This PR is now almost ready to merge. I'm waiting on exact priorities for the new target classes from the Milky Way Group, then I'll proceed.

desihub / desitarget

Target selection for the first MWS Stream Program (GD1) #814