Closed yymao closed 3 years ago
The rows are in the same order of as the object catalogs, with unmatched truth entries in the same tract appended right after. So one can easily join this table with object catalog by
object_cat = pd.concat([object_cat, truth_cat[:len(object_cat)]], 1)
Columns in the truth-match table are the ones in the original truth table plus tract
, patch
, cosmodc2_hp
, cosmodc2_id
, is_sn
, match_sep
, match_objectId
, is_unique_truth_entry
.
Simple nearest-neighbor matching using astropy's match_to_catalog_sky
, with no limitation imposed. The match is done within each tract, and we did not take into account the boundary effect.
@yymao Thanks a lot! This is great. My only comment is that, if you pre-filter object_cat using a GCRQuery, then you would have to change the matching method (and do it by objectId
) so adding that in the documentation would be good.
This for sure makes matching faster!
This seems fine to me too. I assume the boundary effect you mention is the ambiguity at the very edges of the tracts: using the truth catalog coordinates may put an object in a different tract than the corresponding object would be assigned using the measured coordinates. I don't personally think this is a big deal, but we should make it clear that this is an issue.
Thank you for the feedback!
@fjaviersanchez: Good point about the issue with using filters
with GCR. I am thinking that we may be able to implement a reader option (say for_composite
) in the new truth-match catalog reader, which when set, it removes the extra unmatched rows such that the GCR composite would just work as usual. And when that reader option is not set, it returns all rows. For people who want to access the files directly they of course need to be careful, and we should note that in our release note.
@jchiang87: Yes, that is the boundary effect I was talking about (I should have been more specific; thank you for the addition!). I too think the fraction of objects affected is very small. I'll see if I can obtain an estimate. In any case we will certainly document this in the release note.
Thanks! that sounds good to me! Also re: edge-issue, I thought that the tracts themselves overlap so that would solve the problem, as long as we leave enough margin. The only issue would come if we use the coordinates in the object catalog sources to define the boundaries for the truth catalog tracts (in the sense that in the object catalogs we remove the tract overlaps by selecting detect_is_Primary
). Is that right? Or is there any instance in which we would lose a match?
We indeed removed duplicates in the object catalogs (by filtering on detect_isPrimary
). In the truth catalog case, however, each unique truth entry is assigned to a unique tract based on skymap.findTract
, and hence there would not be a built-in overlap between tracts.
And hence if there's a truth entry that is really close to the tract boundary and its detected source happens to have be primary in the neighboring tract, then this match won't be establish in the current scheme. I bet this happens very rarely, but will try to quantify.
I'm continuing to validate of this truth-match catalog, and found a potential issue with the underlying truth catalog this morning.
Some galaxy truth entries that appear Javi's matched catalog (which matches to cosmoDC2 + truth stars) do not appear in this new catalog (which uses truth galaxies+stars+SNe). These missing galaxies are not due to the boundary effect that we discussed earlier, but are in fact missing in the underlying truth galaxies parquet files.
Here's a example of galaxy 10379529968
that should be in healpix 10196
. I verified that it indeed appears in cosmoDC2, but not in truth galaxies.
This is just to report the issue. Next step is to figure out why. cc @JoanneBogart.
I believe I found the cause. The truth catalog generation script applies a magnitude cut on cosmoDC2 mag_r_lsst < 29
(see here and here).
Below I plot all objects that are in Javi's matched catalog but not in the new truth-match catalog, they are all galaxies. The orange points all have mag_r_lsst >= 29
, while the red points have mag_r_lsst < 29
. All the red points are indeed near the tract boundary, as we expected.
This exercise also help quantify the boundary effect. In this tract, there are 22 objects that have matches in neighboring tracts. This number correspondings to about 2 in every 1e5 objects affected by the tract split.
I believe that the mag_r_lsst < 29
cut in the truth catalogs is consistent with what went to to the image simulation, and hence it is correct that the truth-match catalog we are producing here should inherit this cut.
@jchiang87 @JoanneBogart can you confirm? Thanks!
Hi Yao,
I have never seen that flavor='spark' option before. Will that cause problems using it without spark or with Dask? What does it actually do?
@cwwalter Spark has more constraints on the schema than others. Setting flavor="spark"
makes sure the schema is sanitized to be compatible with Spark. The resulting parquet file is still a valid parquet file and works with dask and others.
On a side note, to use dask on the parquet files generated by this script, one also needs to install python-snappy
.
Ah great OK thanks. Yes, I have already had to use python-snappy for my dask work too.
I am now happy with the verification and validation of the truth-match catalog. I plan to merge this PR tomorrow (Dec 7) afternoon (US Eastern time), and welcome feedback and further reviews on this PR and/or the truth-match catalog, which can be found in /global/cscratch1/sd/yymao/desc/truth_run2.2_merged
on NERSC.
Once this PR is merged, I think we are ready to proceed to production.
The code changes since my last review all look fine.
There are a couple typos, awkward wordings in notebook headings:
Suggest How are those with different matches distributed spatially?
Suggest Are all of those having different matches present in the new truth catalog?
I don't know what is standard for validation notebooks, but some explanation under the heading "What about those bright unmatched objects?" of what the test is about and what one should expect to see would be welcome.
As part of the DC2 public data release, we want repartition the truth catalogs in tracts, so that end users can access them more easily and do their own matching if they want to. Our truth catalog will include a basic nearest neighbor matching.
There are two scripts added in this PR: (1)
repartition_into_tracts.py
which repartitions a (truth) catalog in to tracts for a given sky map, and (2)merge_truth_per_tract.py
which merges (truth) catalogs in a given (tract) directory, and then run nearest neighbor matching against another (object) catalog.I have run the scripts to generate one tract of the final truth catalog data product. Some verification and inspection will follow.