conda-forge / pyarrow-feedstock

A conda-smithy repository for pyarrow.
BSD 3-Clause "New" or "Revised" License
6 stars 26 forks source link

Fastparquet vs pyarrow file size #41

Closed rpanai closed 6 years ago

rpanai commented 6 years ago

I moved from fastparquet to pyarrow after this post I'm wondering why saving a dataframe to .parq with snappy compression lead to a bigger file using pyarrow.

I generated a dataframe using the function generate_data from the linked post and saved to files

pq.write_table(pa.Table.from_pandas(df), 'csv/df_pa.parq', compression='SNAPPY')
fastparquet.write("csv/df_fp.parq", df, compression='SNAPPY')  

Using pyarrow the wall time is twice as long as using fastparquet (I guess the culprit is pa.Table.from_pandas) and df_pa.parq's size is 513.7 MB vs 490.2 MB of df_fp.parq Using other df I found an even bigger difference. Is there a way to have a better compression?

wesm commented 6 years ago

fastparquet has some non-standard default options compared with other Parquet writers. I would suggest passing use_dictionary=False to disable dictionary encoding; with lots of text data without repetitions dictionary encoding does not always save space, and results in longer encoding times. We have spent much less time optimizing writes vs. reads so I would be happy to investigate further if you have a dataset I could look at

wesm commented 6 years ago

Closing this issue. If you would like to debug performance issues further can you open an issue on the ASF JIRA? https://issues.apache.org/jira/browse/ARROW