apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.49k stars 1.37k forks source link

Adding Bloom filter to small Parquet file bloats in size X1700 #2667

Open asfimport opened 2 years ago

asfimport commented 2 years ago

Converting a small, 14 rows/1 string column csv file to Parquet without bloom filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to ParquetWriter then yields a 1049197B file.

It isn't clear what the extra space is used by.

Attached csv and bloated Parquet files.

Reporter: Ze'ev Maor

Original Issue Attachments:

Note: This issue was originally created as PARQUET-2122. Please see the migration documentation for further details.

asfimport commented 2 years ago

Xinli Shang / @shangxinli: @chenjunjiedada Do you know why?

asfimport commented 2 years ago

Junjie Chen / @chenjunjiedada: That's the default size of the bloom filter. Please configure parquet.bloom.filter.max.bytes to fit.  

asfimport commented 2 years ago

Ze'ev Maor: @chenjunjiedada thanks, that worked, though it does seem odd that a MAX size on bloom filter of 1MB would actually result in 1MB used by a Bloom filter on a column with cardinality of just 14 isn't it?

asfimport commented 2 years ago

Micah Kornfield / @emkornfield: I believe the answer is the Bloom filter implementation isn't adaptive, so it simply preallocates all the bytes necessary.  It would certainly be a nice option to have more adaptive data structures that can scale down for smaller files but is probably a decent amount of work to build consensus around this.