add scripts to repartition truth catalogs in tracts and match with object catalogs

yymao commented 3 years ago

As part of the DC2 public data release, we want repartition the truth catalogs in tracts, so that end users can access them more easily and do their own matching if they want to. Our truth catalog will include a basic nearest neighbor matching.

There are two scripts added in this PR: (1) repartition_into_tracts.py which repartitions a (truth) catalog in to tracts for a given sky map, and (2) merge_truth_per_tract.py which merges (truth) catalogs in a given (tract) directory, and then run nearest neighbor matching against another (object) catalog.

I have run the scripts to generate one tract of the final truth catalog data product. Some verification and inspection will follow.

yymao commented 3 years ago

Data model for the truth-match catalog:

The rows are in the same order of as the object catalogs, with unmatched truth entries in the same tract appended right after. So one can easily join this table with object catalog by

object_cat = pd.concat([object_cat, truth_cat[:len(object_cat)]], 1)

Columns in the truth-match table are the ones in the original truth table plus tract, patch, cosmodc2_hp, cosmodc2_id, is_sn, match_sep, match_objectId, is_unique_truth_entry.

Matching scheme

Simple nearest-neighbor matching using astropy's match_to_catalog_sky, with no limitation imposed. The match is done within each tract, and we did not take into account the boundary effect.

fjaviersanchez commented 3 years ago

@yymao Thanks a lot! This is great. My only comment is that, if you pre-filter object_cat using a GCRQuery, then you would have to change the matching method (and do it by objectId) so adding that in the documentation would be good.

This for sure makes matching faster!

jchiang87 commented 3 years ago

This seems fine to me too. I assume the boundary effect you mention is the ambiguity at the very edges of the tracts: using the truth catalog coordinates may put an object in a different tract than the corresponding object would be assigned using the measured coordinates. I don't personally think this is a big deal, but we should make it clear that this is an issue.

yymao commented 3 years ago

Thank you for the feedback!

@fjaviersanchez: Good point about the issue with using filters with GCR. I am thinking that we may be able to implement a reader option (say for_composite) in the new truth-match catalog reader, which when set, it removes the extra unmatched rows such that the GCR composite would just work as usual. And when that reader option is not set, it returns all rows. For people who want to access the files directly they of course need to be careful, and we should note that in our release note.

@jchiang87: Yes, that is the boundary effect I was talking about (I should have been more specific; thank you for the addition!). I too think the fraction of objects affected is very small. I'll see if I can obtain an estimate. In any case we will certainly document this in the release note.

fjaviersanchez commented 3 years ago

Thanks! that sounds good to me! Also re: edge-issue, I thought that the tracts themselves overlap so that would solve the problem, as long as we leave enough margin. The only issue would come if we use the coordinates in the object catalog sources to define the boundaries for the truth catalog tracts (in the sense that in the object catalogs we remove the tract overlaps by selecting detect_is_Primary). Is that right? Or is there any instance in which we would lose a match?

yymao commented 3 years ago

We indeed removed duplicates in the object catalogs (by filtering on detect_isPrimary). In the truth catalog case, however, each unique truth entry is assigned to a unique tract based on skymap.findTract, and hence there would not be a built-in overlap between tracts.

And hence if there's a truth entry that is really close to the tract boundary and its detected source happens to have be primary in the neighboring tract, then this match won't be establish in the current scheme. I bet this happens very rarely, but will try to quantify.

yymao commented 3 years ago

I'm continuing to validate of this truth-match catalog, and found a potential issue with the underlying truth catalog this morning.

Some galaxy truth entries that appear Javi's matched catalog (which matches to cosmoDC2 + truth stars) do not appear in this new catalog (which uses truth galaxies+stars+SNe). These missing galaxies are not due to the boundary effect that we discussed earlier, but are in fact missing in the underlying truth galaxies parquet files.

Here's a example of galaxy 10379529968 that should be in healpix 10196. I verified that it indeed appears in cosmoDC2, but not in truth galaxies.

This is just to report the issue. Next step is to figure out why. cc @JoanneBogart.

yymao commented 3 years ago

I believe I found the cause. The truth catalog generation script applies a magnitude cut on cosmoDC2 mag_r_lsst < 29 (see here and here).

Below I plot all objects that are in Javi's matched catalog but not in the new truth-match catalog, they are all galaxies. The orange points all have mag_r_lsst >= 29, while the red points have mag_r_lsst < 29. All the red points are indeed near the tract boundary, as we expected.

This exercise also help quantify the boundary effect. In this tract, there are 22 objects that have matches in neighboring tracts. This number correspondings to about 2 in every 1e5 objects affected by the tract split.

yymao commented 3 years ago

I believe that the mag_r_lsst < 29 cut in the truth catalogs is consistent with what went to to the image simulation, and hence it is correct that the truth-match catalog we are producing here should inherit this cut.

@jchiang87 @JoanneBogart can you confirm? Thanks!

cwwalter commented 3 years ago

Hi Yao,

I have never seen that flavor='spark' option before. Will that cause problems using it without spark or with Dask? What does it actually do?

yymao commented 3 years ago

@cwwalter Spark has more constraints on the schema than others. Setting flavor="spark" makes sure the schema is sanitized to be compatible with Spark. The resulting parquet file is still a valid parquet file and works with dask and others.

On a side note, to use dask on the parquet files generated by this script, one also needs to install python-snappy.

cwwalter commented 3 years ago

Ah great OK thanks. Yes, I have already had to use python-snappy for my dask work too.

yymao commented 3 years ago

I am now happy with the verification and validation of the truth-match catalog. I plan to merge this PR tomorrow (Dec 7) afternoon (US Eastern time), and welcome feedback and further reviews on this PR and/or the truth-match catalog, which can be found in /global/cscratch1/sd/yymao/desc/truth_run2.2_merged on NERSC.

Once this PR is merged, I think we are ready to proceed to production.

JoanneBogart commented 3 years ago

The code changes since my last review all look fine.

There are a couple typos, awkward wordings in notebook headings:

How do those with different matches distribute spatailly?

Suggest How are those with different matches distributed spatially?

Do those have different matches all present in the new truth catalog?

Suggest Are all of those having different matches present in the new truth catalog?

JoanneBogart commented 3 years ago

I don't know what is standard for validation notebooks, but some explanation under the heading "What about those bright unmatched objects?" of what the test is about and what one should expect to see would be welcome.

yymao commented 3 years ago

Thank you @JoanneBogart! I have edited the headings per your suggestions and also added some explanations under "What about those bright unmatched objects?".

Just as a further reference, a relevant slack thread is here.

LSSTDESC / DC2-production

add scripts to repartition truth catalogs in tracts and match with object catalogs #403

Data model for the truth-match catalog:

Matching scheme