Open asfimport opened 1 year ago
Gang Wu / @wgtmac: Apache ORC supports compression of bloom filter. It would be nice if we can do the similar thing. However, I think there is a prerequisite (at least highly relevant): https://issues.apache.org/jira/browse/PARQUET-2257
Gabor Szadovszky / @gszadovszky: @mapleFU, would you mind to do some investigations before this update? Let's get the binary data of a mentioned 2M bloom filter and compress with some codecs to see the gain. If the ratio is good, it might worth adding this features. It is also worth to mention that compressing bloom filter might hit filtering from performance point of view.
In Current Parquet implementions, if BloomFilter doesn't set the ndv, most implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 0.01, the BloomFilter size may grows to 2M for each column, which is really huge. Should we support compression for BloomFilter, like:
Reporter: Xuwei Fu / @mapleFU Assignee: Xuwei Fu / @mapleFU
Note: This issue was originally created as PARQUET-2256. Please see the migration documentation for further details.