This PR rewrites the script write_gcr_to_parquet.py that takes a catalog in GCRCatalogs and writes it to a parquet file.
The original script is not memory-efficient and does not utilize some native feature pyarrow (since at that time we were supporting both pyarrow and fastparquet). Now that we are settled with pyarrow, much of this script should be rewritten to make use of them.
The PR also now allows users to specify config_overwrite to be passed to GCRCatalogs. In particular, this enables users to load only one tract at a time.
@johannct can you test this? I was able to use this new script to generate one tract of dc2_object_run2.2i_dr3 using NERSC's jupyter notebook in just a few seconds. On your machine it might even works with all tracts at once. The call signature is
This PR rewrites the script
write_gcr_to_parquet.py
that takes a catalog in GCRCatalogs and writes it to a parquet file.The original script is not memory-efficient and does not utilize some native feature pyarrow (since at that time we were supporting both pyarrow and fastparquet). Now that we are settled with pyarrow, much of this script should be rewritten to make use of them.
The PR also now allows users to specify
config_overwrite
to be passed to GCRCatalogs. In particular, this enables users to load only one tract at a time.@johannct can you test this? I was able to use this new script to generate one tract of
dc2_object_run2.2i_dr3
using NERSC's jupyter notebook in just a few seconds. On your machine it might even works with all tracts at once. The call signature is