Closed yymao closed 3 years ago
Thanks @wmwv -- Files from this can be found on NERSC at
/global/cscratch1/sd/yymao/desc/dc2_object_run2.2i_dr6_wfd_v1.new2
The following code snippet now works on this new data set:
import pyarrow.parquet as pq
dir_path = "/global/cscratch1/sd/yymao/desc/dc2_object_run2.2i_dr6_wfd_v1.new2"
pq.ParquetDataset(dir_path)
This PR updates
write_gcr_to_parquet.py
to ensure the parquet schema is consistent across files generated within one GCR catalog.The issue is first noticed by @wmwv in https://github.com/LSSTDESC/gcr-catalogs/pull/515#issuecomment-744687756. The cause is that
get_quantites
returns a python dictionary. While a python dictionary is always ordered for Python 3.7+, the implementation ofget_quantites
does not make such assumption nor an effort to keep it ordered. Hence, the returned dictionary can have keys in different orders every time.This PR made two changes to ensure schema consistency:
_chunk_data_generator
where an arrow table is created from the returned dictionary ofget_quantites
, we now iterate over sorted columns instead of dictionary keys._write_one_parquet_file
now accepts a schema input as an optional argument, and hence multiple calls to_write_one_parquet_file
can share the same schema, ensure the resulting files must have the same schema.Note: this PR is made with respect to the branch in #407, so that the new changes can be more easily identified. This PR should be merged into master after #407 is merged.