LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Issues/326 - Add DPDD Source Catalog data product #330

Closed wmwv closed 5 years ago

wmwv commented 5 years ago

Source Catalog files now generated and installed at NERSC using new merge_source_cat.py script and associated SLURM job execution. 129,512,365 Source Ids.

Produces one Parquet file per visit. Each file is 10-20 MB, with the entire set of 1,995 visits totaling 25 GB. Reading these all in is slow and takes 5-15 minutes, with the variance likely due to load or memory pressure on the JupypterLab node.

Files were processed with a 8-node Taskfarmer SLURM job. The job took 4 hours to run (preceeded 36 hours waiting in the queue)

There is an updated scripts/README.md that details what was done to produce these.

There is a Notebook/verify_source_table.ipynb to test simple properties.

There is a reader in the issues/274 branch of gcr-catalogs. The above Notebook shows how to use it if you've checked out a local copy ofgcr-catalogs`.

Future work should work on performance as optimized for certain use cases. The current performance will not scale to Run 2.1.

wmwv commented 5 years ago

This can be run either matching to the DPDD Object Table (if passed a --reader) or individually reading and matching against the coadd merged detection reference catalogs via the butler (if no --reader is passed). The resulting Object IDs are the same (and I have verified that they are the same).

wmwv commented 5 years ago

@jiwoncpark Thanks for the review! Minor comments resolved. Small function added to ensure visit list is unique.

Could you be kind enough to take another look?

wmwv commented 5 years ago

@yymao Can you formally approve this (in your role as blessed approver) so that I can merge this. @jiwoncpark is happy with it.