apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.31k stars 3.48k forks source link

[Python] write_to_dataset poor performance when splitting #19099

Closed asfimport closed 5 years ago

asfimport commented 6 years ago

Hello,

Posting this from github (master @wesm asked for it :) )

https://github.com/apache/arrow/issues/2138

 


import pandas as pd 
import numpy as np 
import pyarrow.parquet as pq 
import pyarrow as pa 

idx = pd.date_range('2017-01-01 12:00:00.000', '2017-03-01 12:00:00.000', freq = 'T') 
dataframe = pd.DataFrame({'numeric_col' : np.random.rand(len(idx)), 
                          'string_col' : pd.util.testing.rands_array(8,len(idx))}, 
                         index = idx)

 


df["dt"] = df.index 
df["dt"] = df["dt"].dt.date 
table = pa.Table.from_pandas(df) 
pq.write_to_dataset(table, root_path='dataset_name', partition_cols=['dt'], flavor='spark')

 

this works but is inefficient memory-wise. The arrow table is a copy of the large pandas daframe and quickly saturates the RAM.

 

Thanks!

Reporter: Olaf / @randomgambit

Related issues:

Note: This issue was originally created as ARROW-2709. Please see the migration documentation for further details.

asfimport commented 5 years ago

Lee June Woo: Hello,

May I ask you simple question about the improvement? I think that It seem to be more efficient to split the pandas dataframe base on "dt" column before converting dataframe to arrow table.

Would you have any plan to implement group-by operation of arrow table or improve write_to_dataset function?

asfimport commented 5 years ago

Wes McKinney / @wesm: We do plan to implement group-by operations on Arrow tables eventually. If you would like to propose some improvements in the meantime, please go right ahead

asfimport commented 5 years ago

Joris Van den Bossche / @jorisvandenbossche: This seems a duplicate of ARROW-2628, so closing this issue (both are about the (memory) performance issues due to the usage of pandas' groupby functionality). I will update the other issue with some of the discussion in the closed PR.

asfimport commented 5 years ago

Wes McKinney / @wesm: Thanks. I hope to see the group-splitting implemented natively against Arrow tables at some point