Open asfimport opened 2 years ago
Xinli Shang / @shangxinli: @chenjunjiedada Do you know why?
Junjie Chen / @chenjunjiedada: That's the default size of the bloom filter. Please configure parquet.bloom.filter.max.bytes to fit.
Ze'ev Maor: @chenjunjiedada thanks, that worked, though it does seem odd that a MAX size on bloom filter of 1MB would actually result in 1MB used by a Bloom filter on a column with cardinality of just 14 isn't it?
Micah Kornfield / @emkornfield: I believe the answer is the Bloom filter implementation isn't adaptive, so it simply preallocates all the bytes necessary. It would certainly be a nice option to have more adaptive data structures that can scale down for smaller files but is probably a decent amount of work to build consensus around this.
Converting a small, 14 rows/1 string column csv file to Parquet without bloom filter yields a 600B file, adding '.withBloomFilterEnabled(true)' to ParquetWriter then yields a 1049197B file.
It isn't clear what the extra space is used by.
Attached csv and bloated Parquet files.
Reporter: Ze'ev Maor
Original Issue Attachments:
Note: This issue was originally created as PARQUET-2122. Please see the migration documentation for further details.