dask / fastparquet

python implementation of the parquet columnar file format.
Apache License 2.0
772 stars 177 forks source link

Should `partition_on` columns be included in the pandas_metadata? #260

Open TomAugspurger opened 6 years ago

TomAugspurger commented 6 years ago

I think they should be. Just checking if there was a reason they weren't @martindurant

In [20]: import pandas as pd

In [21]: import fastparquet as fp

In [22]: import json

In [23]: df = pd.DataFrame({"A": [1, 2], 'B': [3, 4]}, index=pd.Index(['a', 'b'], name='C'))

In [24]: fp.write("foo.parq", df, partition_on=['B'], file_scheme='hive')

In [25]: json.loads(fp.ParquetFile("foo.parq").fmd.key_value_metadata[0].value)
Out[25]:
{'columns': [{'metadata': None,
   'name': 'C',
   'numpy_type': 'object',
   'pandas_type': 'unicode'},
  {'metadata': None,
   'name': 'A',
   'numpy_type': 'int64',
   'pandas_type': 'int64'}],
 'index_columns': ['C'],
 'pandas_version': '0.22.0.dev0+131.g63e8527d3'}
martindurant commented 6 years ago

I suppose they should be in the global metadata, but not in the individual data files. Is it acceptable to have the metadata different in different places? You could have them in the data files only if you explicitly ingore them on load.

xhochy commented 6 years ago

I would also include this in _common_metadata and _metadata files but not in the individual files. The individual files themselves should contain exactly the information that are needed to load them standalone. The *metadata files rather describe the schema of the whole dataset/table.