apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.56k stars 1.4k forks source link

Add union API to BloomFilter interface #2469

Open asfimport opened 4 years ago

asfimport commented 4 years ago

Sometimes, one may want to build a file-level bloom filter by union all row groups bloom filters so that to save some memory. Add a union API that could make it easy to use.

Reporter: Junjie Chen / @chenjunjiedada Assignee: Walid Gara / @garawalid

PRs and other links:

Note: This issue was originally created as PARQUET-1815. Please see the migration documentation for further details.

asfimport commented 4 years ago

Walid Gara / @garawalid: Thanks for the suggestion @chenjunjiedada. If you didn't implement it yet I have already one, I can open PR, otherwise, I'll review yours. Also, I thought of the intersection between bloom filters, it can be useful as well.

asfimport commented 4 years ago

Junjie Chen / @chenjunjiedada: I think you can go ahead:)

asfimport commented 4 years ago

Walid Gara / @garawalid: Thanks :)

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: The currently implemented filters in parquet-mr (e.g. dictionary filter, column indexes) are created for internal use. It means that the user does not have to care about them, it simply sets the filter and gets the values required without knowing which filter implementation is dropping the unneeded values. What is not clear to me in this jira is that how the user would benefit from the union of the bloom filters.

asfimport commented 4 years ago

Walid Gara / @garawalid: In the parquet-mr, we use bloom filters to filter values. Since we already computed them and they exist in the footer, they can be exploited beyond internal use. Just by performing the union on all bloom filters per parquet file, we can create one bloom filter with a higher false-positive rate. Then, it will be used as an index (kind of metadata) in some projects such as Apache Iceberg.

This is just a simple use case, you can find in this paper more use cases like bloom joins and others: Role of Bloom Filter in Big Data Research: A Survey

asfimport commented 4 years ago

Gabor Szadovszky / @gszadovszky: If one would like to use bloom filters out of the very scope of parquet-mr (e.g. to union the bloom filters of several files for a partition of a table) then I think providing the interface for the bloom filter is not a good idea. E.g. Iceberg supports the file formats Avro, Parquet and Orc. Orc also has its own implementation for bloom filters. If we would like to support this example scenario in Iceberg, it would be better to use a common interface for bloom filters that is not part of the Parquet API.

I am not against implementing this functionality in parquet-mr (it is not a complex one anyway), I've just missed a usecase and I think it is a bit early to implement such functionality without a driver case.

asfimport commented 4 years ago

Walid Gara / @garawalid: I see it differently. Perform union and intersection depends on the implementation of the bloom filter (in our case: BlockSplitBloomFilter). So users don't need to understand the internal implementation, they can directly use the API.