apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

Adding Compression for BloomFilter #408

Open asfimport opened 1 year ago

asfimport commented 1 year ago

In Current Parquet implementions, if BloomFilter doesn't set the ndv, most implementions will guess the 1M as the ndv. And use it for fpp. So, if fpp is 0.01, the BloomFilter size may grows to 2M for each column, which is really huge. Should we support compression for BloomFilter, like:

 


 /\*\*
- The compression used in the Bloom filter.
 \*\*/
struct Uncompressed {}
union BloomFilterCompression {
  1: Uncompressed UNCOMPRESSED;
+2: CompressionCodec COMPRESSION;
}

Reporter: Xuwei Fu / @mapleFU Assignee: Xuwei Fu / @mapleFU

Note: This issue was originally created as PARQUET-2256. Please see the migration documentation for further details.

asfimport commented 1 year ago

Gang Wu / @wgtmac: Apache ORC supports compression of bloom filter. It would be nice if we can do the similar thing. However, I think there is a prerequisite (at least highly relevant): https://issues.apache.org/jira/browse/PARQUET-2257

asfimport commented 1 year ago

Gabor Szadovszky / @gszadovszky: @mapleFU, would you mind to do some investigations before this update? Let's get the binary data of a mentioned 2M bloom filter and compress with some codecs to see the gain. If the ratio is good, it might worth adding this features. It is also worth to mention that compressing bloom filter might hit filtering from performance point of view.

asfimport commented 1 year ago

Xuwei Fu / @mapleFU: @gszadovszky Yes, I'd like to. I think having compression in standard doesn't means we need always compression. We can do it only when original BloomFilter occupy a lot of space and compression can save lots of time