Closed rpanai closed 6 years ago
fastparquet has some non-standard default options compared with other Parquet writers. I would suggest passing use_dictionary=False
to disable dictionary encoding; with lots of text data without repetitions dictionary encoding does not always save space, and results in longer encoding times. We have spent much less time optimizing writes vs. reads so I would be happy to investigate further if you have a dataset I could look at
Closing this issue. If you would like to debug performance issues further can you open an issue on the ASF JIRA? https://issues.apache.org/jira/browse/ARROW
I moved from fastparquet to pyarrow after this post I'm wondering why saving a dataframe to
.parq
withsnappy
compression lead to a bigger file using pyarrow.I generated a dataframe using the function
generate_data
from the linked post and saved to filesUsing
pyarrow
the wall time is twice as long as usingfastparquet
(I guess the culprit ispa.Table.from_pandas
) and df_pa.parq's size is 513.7 MB vs 490.2 MB of df_fp.parq Using other df I found an even bigger difference. Is there a way to have a better compression?