[Python] parquet.write_to_dataset is memory-hungry on large DataFrames

asfimport commented 6 years ago

See discussion in https://github.com/apache/arrow/issues/1749. We should consider strategies for writing very large tables to a partitioned directory scheme.

Reporter: Wes McKinney / @wesm

Related issues:

[Python] write_to_dataset poor performance when splitting (duplicates)

_{Note: This issue was originally created as ARROW-2628. Please see the migration documentation for further details.}

asfimport commented 5 years ago

Joris Van den Bossche / @jorisvandenbossche: Similar report in ARROW-2709, which had a PR linked: https://github.com/apache/arrow/pull/3344. This PR is closed (without merging), but contains some relevant discussion.

The current implementation of pyarrow.parquet.write_to_dataset converts the pyarrow Table to a pandas DataFrame, and then uses pandas' groupby method to split it in multiple dataframes (after dropping the partition columns from the dataframe, which makes yet another data copy). Each subset pandas DataFrame is then converted back to a pyarrow Table. In addition, when using this functionality from pandas' to_parquet, you get an additional initial conversion of the pandas DataFrame to arrow Table.

This clearly is less than optimal. It might be that some of the copies could be avoided (e.g. it is not clear to me if Table.from_pandas always copies data). The closed PR tried to circumvent this by using arrow's dictionary encoding instead of pandas' groupby, and then reconstructing the subset Tables based on those indices). But ideally, a more arrow-native solution is used instead of those work-arounds.

To quote @wesm from the PR (github comment):

I want the proper C++ work completed instead. Here are the steps:

Struct hash: ARROW-3978

Integer argsort part of ARROW-1566

Take function ARROW-772

These are the three things you need to do the groupby-split operation natively against an Arrow table. There is a slight complication which is implementing "take" against chunked arrays. This could be mitigated by doing the groupby-split at the contiguous record batch level

asfimport commented 4 years ago

Wes McKinney / @wesm: cc @jorisvandenbossche – this will be an important dataset use case

apache / arrow

[Python] parquet.write_to_dataset is memory-hungry on large DataFrames #19025

Related issues: