apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.36k stars 691 forks source link

Expose flexible apis for building bloom filters when writing parquet #5108

Open tjwilson90 opened 8 months ago

tjwilson90 commented 8 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I'm want to generate parquet files that efficiently support prefix, substring, and suffix queries on strings. E.g., given a column of strings, find all strings containing a given query as a substring as quickly as possible.

Currently, as far as I know, this can't be done in the general case without loading all the data and doing an exhaustive scan.

Describe the solution you'd like

What I'd like to be able to do is build a bloom filter for my data consisting of ngrams. E.g., if my column contained the string "Hello, World!", and I was building a bloom filter with ngrams of length 4, I'd want to insert the sequences ["Hell", "ello", "llo,", ..., "rld!"] into a bloom filter. Later, when querying for a substring, I'd check if all ngrams of the same length were in the bloom filter. E.g., to search for "World", I would check whether the bloom filter contained "Worl" and "orld".

I don't really want this exact feature to be implemented in this library because I actually want a bit of flexibility to change things (e.g., maybe I want to index additional ngrams containing sentinels for start of word and end of word, maybe I want to delay the sizing of the bloom filter until I observe how many distinct ngrams I have) and I don't necessarily think my solution described above is appropriate as a general solution for everyone. Instead, I want to have the ability to customize how bloom filters are built and then implement this logic myself.

I'm not sure exactly how to best implement this, but I imagine something like having a trait

pub trait SbbfBuilder {
    fn insert<T: AsBytes + ?Sized>(&mut self, value: &T);
    fn build(self) -> Sbbf;
}

allowing a Box<dyn SbbfBuilder> to be set on ColumnProperties and changing the bloom_filter: Option<Sbbf> in ColumnValueEncoderImpl and ByteArrayEncoder to instead be a Option<Box<dyn SbbfBuilder>> would work.

tustvold commented 8 months ago

I'm not sure such an approach is compatible with the parquet specification. Have you considered computing bloom filters separately and storing them as part of your catalog, instead of in the parquet files themselves? This would also let you use more sophisticated bloom filters than supported by parquet.

tjwilson90 commented 8 months ago

Do you know where this is specified? In my reading of https://github.com/apache/parquet-format/blob/master/BloomFilter.md I don't see any description of what the bloom filters are supposed to contain.

I certainly could write bloom filters externally, it's just less convenient since 1) it adds extra files that I need to maintain / are consistent, and 2) generating it would require either an extra read through a parquet file after writing it, or foregoing the higher level utilities like ArrowWriter that control how column chunks are sized.

tustvold commented 8 months ago

Do you know where this is specified

The Bloomfilter specification is here. As you note it never explicitly states what the BloomFilter contains, however, it does state:

In their current format, column statistics and dictionaries can be used for predicate pushdown. Statistics include minimum and maximum value, which can be used to filter out values not in the range. Dictionaries are more specific, and readers can filter out values that are between min and max but not in the dictionary. However, when there are too many distinct values, writers sometimes choose not to add dictionaries because of the extra space they occupy. This leaves columns with large cardinalities and widely separated min and max without support for predicate pushdown.

The implication being that the bloom filter contains the same value data as is used in statistics and dictionaries.

This is also borne out by the implementation in parquet-mr - https://github.com/apache/parquet-mr/blob/452c94d20abda0a83101d00f3b697e110d744942/parquet-column/src/main/java/org/apache/parquet/column/impl/ColumnWriterBase.java#L210

That being said as BloomFilters approximate a set you could conceivably include ngrams in addition, but this will necessarily increase the false positive rate for a given filter size.