LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Rewrite write_gcr_to_parquet.py #395

Closed yymao closed 4 years ago

yymao commented 4 years ago

This PR rewrites the script write_gcr_to_parquet.py that takes a catalog in GCRCatalogs and writes it to a parquet file.

The original script is not memory-efficient and does not utilize some native feature pyarrow (since at that time we were supporting both pyarrow and fastparquet). Now that we are settled with pyarrow, much of this script should be rewritten to make use of them.

The PR also now allows users to specify config_overwrite to be passed to GCRCatalogs. In particular, this enables users to load only one tract at a time.

@johannct can you test this? I was able to use this new script to generate one tract of dc2_object_run2.2i_dr3 using NERSC's jupyter notebook in just a few seconds. On your machine it might even works with all tracts at once. The call signature is

   python write_gcr_to_parquet.py dc2_object_run2.2i_dr3 [--tract tract]
johannct commented 4 years ago

fixed with LSSTDESC/gcr-catalogs/#437