LSSTDESC / DC2-production

Configuration, production, validation specifications and tools for the DC2 Data Set.
BSD 3-Clause "New" or "Revised" License
11 stars 7 forks source link

Ensure schema is consistent across files generated by write_gcr_to_parquet.py #411

Closed yymao closed 3 years ago

yymao commented 3 years ago

This PR updates write_gcr_to_parquet.py to ensure the parquet schema is consistent across files generated within one GCR catalog.

The issue is first noticed by @wmwv in https://github.com/LSSTDESC/gcr-catalogs/pull/515#issuecomment-744687756. The cause is that get_quantites returns a python dictionary. While a python dictionary is always ordered for Python 3.7+, the implementation of get_quantites does not make such assumption nor an effort to keep it ordered. Hence, the returned dictionary can have keys in different orders every time.

This PR made two changes to ensure schema consistency:

  1. First, in _chunk_data_generator where an arrow table is created from the returned dictionary of get_quantites, we now iterate over sorted columns instead of dictionary keys.
  2. _write_one_parquet_file now accepts a schema input as an optional argument, and hence multiple calls to _write_one_parquet_file can share the same schema, ensure the resulting files must have the same schema.

Note: this PR is made with respect to the branch in #407, so that the new changes can be more easily identified. This PR should be merged into master after #407 is merged.

yymao commented 3 years ago

Thanks @wmwv -- Files from this can be found on NERSC at

/global/cscratch1/sd/yymao/desc/dc2_object_run2.2i_dr6_wfd_v1.new2
yymao commented 3 years ago

The following code snippet now works on this new data set:

import pyarrow.parquet as pq
dir_path = "/global/cscratch1/sd/yymao/desc/dc2_object_run2.2i_dr6_wfd_v1.new2"
pq.ParquetDataset(dir_path)