Add bloom filters to parquet statistics

asfimport commented 10 years ago

For row groups with no dictionary, we could still produce a bloom filter. This could be very useful in filtering entire row groups. Pull request: https://github.com/apache/parquet-mr/pull/215

Reporter: Alex Levenson / @isnotinvain Assignee: Junjie Chen / @chenjunjiedada

Subtasks:

[X] Define the parquet bloom filter statistics in parquet format
[X] [C++] Cross compatibility support with parquet-mr
[X] [C++] Bloom filter read/write implementation
[X] [java]Bloom filter read/write implementation
[X] [C++] Integrate Bloom filter into row group filter logic
[X] [C++] Add bloom filter utility class
[X] Add bloom filter utility class
[X] [C++] replace shared_ptr to unique_ptr in Bloom filter buffer allocation
[X] [C++] move Bloom filter test binary to parquet-testing repo
[X] [java] Integrate Bloom filter logic
[X] Support nested column Bloom filter
[X] Perform encoding before bloom filters write out
[X] Store Bloom filters near to footer.
[X] update hash naming of bloom filter
[X] Update parquet thrift to align with spec
[X] Resolve Bloom filter spec concerns
[X] merge bloom filter feature branch to master
Related issues:
[C++] Add bloom filter utility class (Blocked)
Implement the bloom filter for the ParquetSerde (blocks)
Optional bloom_filter_offset is filled if no bloom filter is present (relates to)
Use Parquet bloom filter (relates to)
Refactor the configuration for bloom filters (relates to)
Support writing bloom filters in Parquet (is related to)
Leverage parquet bloom filters (is related to)
PRs and other links:
GitHub Pull Request #431
GitHub Pull Request #432
GitHub Pull Request #112
GitHub Pull Request #113
GitHub Pull Request #62
GitHub Pull Request #99
GitHub Pull Request #425
GitHub Pull Request #484
GitHub Pull Request #757
Bloom filter calculations: Filter size by FPP and uniqueness and expected FPP from overloading

_{Note: This issue was originally created as PARQUET-41. Please see the migration documentation for further details.}

asfimport commented 9 years ago

Alex Levenson / @isnotinvain: This will require some planning and discussion around parquet-format. We need to choose how we will serialize a bloom filter into parquet statistics, how it's size will be configured, etc. But ultimately it needs to be a bloom filter easily deserialized from C++, and it needs to be durable (so java.io.Serializable or anything similar won't work here.)

asfimport commented 9 years ago

Daniel Weeks / @danielcweeks: While it would be great to have them be compatible between java/C++/etc, that shouldn't necessarily be a requirement. A bloom filter is an optimization, so if the reader doesn't know how to interpret it, it can simply be ignored.

asfimport commented 9 years ago

Antonios Chalkiopoulos / @Antwnis: I don't know if it helps - but in scala , using algebird, when i want to serialize a Bloom Filter into a file and then be able to re-read it from another job i do:

def serialize(dataStructure: Any): Array[Byte] = { val stream = new ByteArrayOutputStream() val out = new ObjectOutputStream(stream) out.writeObject(dataStructure) out.close() stream.close() stream.toByteArray }

def deserialize[T](byteArray: Array[Byte]): T = { val is = new ObjectInputStream(new ByteArrayInputStream(byteArray)) is.readObject().asInstanceOf[T] }

and then from i.e. Scalding

pipe.mapTo('bloom -> 'serialized) { bf:BF => io.scalding.approximations.Utils.serialize(bf) }.write( ... ) pipe.mapTo('bloom -> 'serialized) { bf:BF => new String(io.scalding.approximations.Utils.serialize(bf)) }.write( ... )

Where BF is actually com.twitter.algebird.BF

asfimport commented 9 years ago

Alex Levenson / @isnotinvain: If the data for the bloom filter is going to be stored in the parquet-format thrift statistics schema, I think it should be well a well defined format, that is not specific to java (for example, the format should not be whatever comes out of an ObjectOutputStream). I think it's fine that the two runtimes support different features, but so far we have maintained binary compatibility between the two.

Even if we didn't care about interoperability, we still need a well defined binary format for the bloom filter that is durable over time, unlike plain java.io serialization.

asfimport commented 9 years ago

Alex Levenson / @isnotinvain: Right, so while this works for temporary storage / transport of java objects, it's not a durable format with guaranteed stability over time. If the algebird bloom filter changes later, it will serialize differently, etc. So I think we need to still define a thrift / binary format for storing the bloom filter.

asfimport commented 9 years ago

Alex Levenson / @isnotinvain: To clarify, I'm not saying the C++ has to use the bloom filter at runtime right away, but we should not use a binary format that would make it impossible / difficult for the C++ impl to use it if it wanted to. And I think it would also be bad news if each impl serialized bloom filters in their own way.

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: Hi @isnotinvain and @Antwnis, thank you for your comments. I'd like to have a short summary for your summary. We should implement the BF by thrift other than some unstable framework library. Regards to the size of bloom filter, I think we should configure it from the upper level since user can have a better understanding of the size of column numbers. Any thought about the size configuration?

asfimport commented 9 years ago

Julien Le Dem / @julienledem: The first step is to specify how we would store the Bloom Filter in parquet-format. As @isnotinvain mentioned it should be defined at the binary level. A bloom filter is just a byte array (or possibly a few) and probably the spec of the hash functions used so it should not be that hard to defined.

Alternatively there are custom properties that can be added for non standard things but I would recommend going the way defined above for bloom filters as they are a standard we want to define.

listing a few folks that should probably review the format for this (from Impala, Drill, SparkSQL): [~marcelk] @nongli [~pipfiddle] @jaltekruse @rdblue [~marmbrus]

asfimport commented 9 years ago

Ryan Blue / @rdblue:

As Alex Levenson mentioned it should be defined at the binary level . . . and probably the spec of the hash functions used so it should not be that hard to defined.

+1

asfimport commented 9 years ago

Alex Levenson / @isnotinvain: Yes, I was just thinking, the hash functions for each data type need to be well defined too but Julien beat me to it :)

So I would say next steps are: 1) Define a binary format for serializing a bloom filter into a byte array, including any configuration data (like size), and also discuss whether that should be global (stored once) or repeated into each bloom filter.

2) Define the hash function used for each data type

3) Create a java implementation of the above, possibly borrowing code snippets from Algebird or Guava

asfimport commented 9 years ago

Jacques Nadeau / @jacques-n: Agree with @isnotinvain, @julienledem, and @rdblue. We need non-language-specific binary specification. I'd also argue that we need to see highly performant proof of concept code in C and Java before we can decide that a particular specification is acceptable.

asfimport commented 9 years ago

Prateek Rungta: ORC implemented bloom filters recently. Their design doc - 1 - is worth a gander.

asfimport commented 9 years ago

Ryan Blue / @rdblue: Thanks [~prateekrungta]! Looks like I need to go read "Less Hashing, Same Performance: Building a Better Bloom Filter" by Kirsch et. al. now.

asfimport commented 9 years ago

Ryan Blue / @rdblue: @winningsix does the info about ORC or the paper help at all?

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: Thanks @rdblue and [~prateekrungta] for this information, I will take time look at it.

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: Hi guys, The pull request for parquet-format-mr is located at https://github.com/apache/parquet-mr/pull/215 and the one for parquet-format is at https://github.com/apache/parquet-format/pull/28. Currently, I only add the support for integer and test passed for unit test and hive side. The second one is used to define the data structure. Please help me review these two PRs. Thank you!

asfimport commented 9 years ago

Ryan Blue / @rdblue: Great, thanks @winningsix! Could you also tell us a bit more about how this works and the approach you're taking? At first glance, we need quite a bit more in the format to specify exactly what the structure means and how to use it. It would be good to discuss that here, too.

asfimport commented 9 years ago

Jacques Nadeau / @jacques-n: Agreed. I think a good discussion and design review around this is critical given the importance of this feature.

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: Sure, for the basic background about bloom filter, please refer to the Wikipedia (bloom filter) and the discussion above. As I already have a patch available for this feature, I’d like to explain my design based on my code.

Background for bloom filter In short, bloom filter maintains a bit set which is set to 1 for all bits. Once one entry added to the filter, it will use the hash functions to set several bits to 1. By using a hash function, we can see a new entry is already included if all of the related bits are set to 1. In this way we can use the bloom filter to skip some data in block level for parquet. From the Wikipedia, you can see the definition of bloom filter as follows:

An empty Bloom filter is a bit array of m bits, all set to 0. There must also be k different hash functions defined, each of which maps or hashes some set element to one of the m array positions with a uniform random distribution. To construct a bloom filter, two parameters are required:

m: Number of bits

k: Number of hash function

These two parameters could be approximated by two other parameters (The formulas are available in the Wikipedia mentioned above):

f: abbreviation for false positive probability
n: the number of elements being filtered The code snippet for the data structure of bloom filter is mainly referred from Hive project and the implementation of bloom filter from Guava.

Build a bloom filter To build a bloom filter, users need to add four configurations:

parquet.enable.bloom.filter: Whether to enable the bloom filter
parquet.bloom.filter.fpp.value: The value of FPP (Abbr. false positive probability)
parquet.bloom.filter.fpp.provided: As the fpp is optional, so this configuration is not required and the default value is 0.05
parquet.expected.entries: The expected entries number

These configurations are grouped into BloomFilterOpts. The class BloomFilterBuilder could use BloomFilterOpts to build a BloomFilter. And the bloom filter is supported only in some of the basic data type. For example, it doesn’t make sense to enable bloom filter for a bool type. Addressing this concern, the supported Statistics needs to implement the interface BloomFilterStatistics.

How BloomFilter works in Parquet The bloom filter is part of the statistics like Min/Max Statistics. For EQ operation, you could check if one entry is included in the row group. You could refer to the code snippet in StatisticsFilter. If the number isn’t within the range or not include in the bloom filter, it will be dropped.

How to enable it in Hive side You can refer the code in HIVE-9260. Besides the bloom filter related configuration, you need to add the parameters to enable predicate pushing down as well. ``` SET hive.optimize.ppd=true; SET hive.optimize.index.filter=true;



**ARs for my side about next developing plan**
- Update design document in the statistics.md
- Add more JavaDoc for some common API
- Update some code style issues mentioned in the PRs
- If no more further comments exists for the present patch, I will add the support for some other data types.

Feel free to raise more questions about my patch. Thank you!

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: Hi, Any suggestion or comments about my current solution? I'm also thinking about using the Bloom Filter API from Guava instead of implementing it by our own. <http://docs.guava-libraries.googlecode.com/git-history/release/javadoc/com/google/common/hash/BloomFilter.html#create(com.google.common.hash.Funnel, int, double)> In the first step we should finalize what we should store in the parquet-format. If trying to use the guava, we will store the expected insertions and the false positive probabilities which could be different from the current solution.

With the regards of the comments from [~dwhite], we could put the discussion of multi-strategies support here. And also we could discuss about how we archive the fall back for bloom filter as @spena suggests.

Thank you!

asfimport commented 9 years ago

Prateek Rungta: Hey @winningsix,

I did a quick glance through the code and it looks pretty good to me. Few points:

Even though it seems like you only use the 64-bit hash in the BloomFilter creation at the moment, I'd suggest you add a way to specify which hash function was used to the BloomFilter structure - https://github.com/apache/parquet-format/pull/28/files#diff-df72922e72e5b91cbb5728429610e912R194 - i.e. if the hash function used was 32/64/128 bits. To enforce an explicit contract between readers/writer of the data.
Regarding migrating to the guava implementation of Murmur3, one of the SolrCloud devs has a pretty good write-up on why he went the route of making his own, instead of using Guava - http://yonik.com/murmurhash3-for-java/ - the code is public domain, so we could use that maybe.
One of the points in the blog above is that Guava has interoperability issues between C++ & Java. So regardless of which hash implementation we go with, I suggest we add test readers in at least those two languages to ensure the BF implementation is interoperable.
This maybe followup work to this JIRA: some rudimentary performance analysis - read/write overhead with BFs enabled - would be useful.

-prateek

asfimport commented 9 years ago

Jason Altekruse / @jaltekruse: I did not get a chance to look through the code yet, but one possible consideration to think about in regards to Hive. For their ACID support, it would be useful to allow the bloom filter to be modifiable upon deletion of a record before a compaction of the original data and the diff file. If you design the bloom filter to be an array of integers instead of an array of bits you can remove elements from it as well as add, as each position can now store the number of hashes that ended up in that position, rather than a flag to say at least one ended up there. This would allow them and other users of parquet that may want to implement update/delete to subtract out the elements from the bloom filter that have been deleted or changed.

asfimport commented 9 years ago

Ryan Blue / @rdblue: Sorry for the delay, @winningsix, I'll have more feedback for you today.

@jaltekruse, it isn't possible to remove a record from a bloom filter, unless you recompute the filter without it. The k positions for each element overlap, which is why you get a great size reduction. Unfortunately, that means you can't unset one without possibly changing the inclusion judgement for all the other inserted elements that overlapped that bit position.

asfimport commented 9 years ago

Jason Altekruse / @jaltekruse: If you store a bit array that is true. There is a related datastructure, as I understood it there is not an alternative name for it that does allow for deletions. The trade off is storage size, but this is tunable just as the regular bloom filter is in terms of a trade off between size and false positive rate.

http://www.ics.uci.edu/~jsimons/slides/seminar/IBF.pdf

asfimport commented 9 years ago

Jason Altekruse / @jaltekruse: This might have been a little confusing, I was not proposing that Hive could re-write the filter in place, but instead store a bloom filter with the appropriate additions and deletions that could be combined together with the one in the original file. With the integer array this would be as simple as a loop to add together all of the corresponding indexes, where some may be negative in the diff to indicate that several values that hashed to that position had been deleted.

asfimport commented 9 years ago

Ryan Blue / @rdblue: Interesting, I hadn't heard about the counting bloom filters. But as I think a bit more about how the Hive ACID stuff works, I don't think it would help.

The base file is rewritten periodically to incorporate changes stored in the current set of deltas. That would rewrite the bloom filter from scratch, so there is no need for it to be reversible. Then if you're applying a delta on top of the base file, you only need to apply the filters to your delta because those rows entirely replace rows in the base. In that case, you have a static bloom filter per delta file and static bloom filters in the base file, too.

asfimport commented 9 years ago

Jason Altekruse / @jaltekruse: If rewriting happens frequently enough there may not be a need for it. However I think there are some cases that can not be completely solved by two independent bit filters in the base and delta files. Additions can certainly work, but I don't think it is possible to incorporate deletes or updates with this strategy. The bloom filter for your base file will return true when you look up a deleted record.

Even if the delta file absolutely replaces rows from your base file, there is no association between a result of the bloom filter lookup and a row number. You could not know that a particular update in the delta file relates to a lookup in the base file bloom filter.

asfimport commented 9 years ago

Nezih Yigitbasi / @nezihyigitbasi: @rdblue I believe @jaltekruse is right that to support element removal (without creating the filter from scratch) one needs to keep an integer/counter array, then you get what's called a counting bloom filter (see https://en.wikipedia.org/wiki/Bloom_filter#Counting_filters)

asfimport commented 9 years ago

Nezih Yigitbasi / @nezihyigitbasi: Not sure we need a counting bloom filter to support Hive acid tx (at least for now). The base file is updated with the deltas with some frequency (called major compaction, frequency depends on the hive.compactor.delta.pct.threshold config parameter) and the bloom filter of the base file will get rewritten with major compaction. I guess it's OK for the bloom filter to return true when a delta file has a delete for that particular record as currently Hive only supports snapshot isolation (that is, if another tx deleted that row after the first tx started).

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: Hi @nezihyigitbasi @jaltekruse, I don’t think we need to handle the deletion case for bloom filter when comes to an ACID table. For ACID in Hive, we should consider three things: read, write, compactor (merge). For R/W operation, it doesn't require the deletion. For read operation, the hive merger reader handles this by creating a user-view data using the base& delta files. For the compaction, there are two kinds: Minor Compaction and Major Compaction. Minor compaction is used to compact the delta files. Several delta files are compacted into a single delta file (_delta_xy where x is the begin and y is end transaction id). So it’s a different file after compaction. Major compaction is used to compact delta files into the base file (_basez where z stands for the highest transaction id compacted). So we should generate a new bloom filter instead of merging the previous ones(Compactor is writing another file in fact). Any thoughts about this @spena @rdblue?

Besides, I read again about the configuration for the support of bloom filter in ORC. It seems we could add one more configuration to specify which column we will apply the bloom filter.

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: Hi, I have updated the PR for multiple bloom filter strategies support. Please help me review it. Thank you! See https://github.com/winningsix/parquet-format/commit/21ecbb245b763e61d51ffdfec85882b8c924cc9d

Yours, Ferd

asfimport commented 9 years ago

Ryan Blue / @rdblue: Are you saying you'd have a bloom filter in the diff that gets subtracted from the bloom filter in the base file while reading?

If so, I don't think that works. You'd have to know what records are in a particular page, calculate the bloom filter diff, and then use it. I don't think that would be worth it.

asfimport commented 9 years ago

Ryan Blue / @rdblue: I don't think the counting bloom filter idea is worth the increased size or the work to make it happen, when the trade-off is a false-positive. The ACID support will periodically rebuild the bloom filters anyway, so we're only talking about false positives for data in the delta files, which we expect to be small.

asfimport commented 9 years ago

Ryan Blue / @rdblue: Thanks for working on this, @winningsix, it's great to be making some good progress on it. This is getting to be a pretty long comment. I don't have all that many conclusions, but I wanted to share some observations to start a discussion around how this feature should be done.

I've mostly been thinking lately about the bloom filter configuration. I like that FPP is a user setting because the query patterns really affect what value you want for it. You can get much better space savings with a high FPP if you know that typical queries will only look for a few items.

We can think of FPP as the probability that we will have to read a data page even though it doesn't actually have the item we are looking for. That is multiplied by the number of items in a query, which could be large but I think will generally be less than ~10 elements (for basing a default). That puts a general upper limit on the FPP because if it is something too high, like 10%, a fair number of queries will end up reading unnecessary data with a 50+% probability (anything checking for 7 or more unique items).

I think we should have a way to read the page stats without the filter, since they can be pretty big. I took a look at a real-world dataset with 8-byte timestamps that are ~75% unique, which put the expected filter size for a 2.5% false-positive rate at 9% of the block size. If I'm looking for 32 timestamps at once, I have an 80% chance of reading pages I don't need to read, and end up reading an extra 9% for every page's bloom filter alone.

I don't think we want a setting for the expected number of entries. For one thing, this varies widely across pages. I have a dataset with 20-30 values per page in one column and 131,000 values per page in another. A setting for all columns will definitely be a problem and I don't think we can trust users to set this correctly for their data on every column.

We also don't know much about how many unique values are in a column or how that column will compress with the encodings. Bloom filters are surprisingly expensive in terms of space considering some of the encoding sizes we can get in Parquet. For example, if we have a column where delta integer encoding is doing a good job, values might be ~2 bytes each. If the column is 75% unique, then even a 10% FPP will create a bloom filter that is ~22.5% of the page size, and a 1% FPP is ~44.9% of the page size. To compare to not as good encoding, 8-bytes per value ends up being ~11.2% of the page size for a 1% FPP, which is still significant. As encoding gets better, pages have more values and the bloom filter needs to be larger.

Without knowing the percentage of unique values or the encoding size, choosing the expected number of values for a page is impossible. Because of the potential size of the filters compared to the page size, over-estimating the filter size isn't enough: we don't want something 10% of the page size or larger. That means that if we chose an estimate for the number of values, we would still end up overloading filters fairly often. I took a look at the false-positive probability for overloaded filters: if a filter is 125% loaded, then the actual false-positive probability at least doubles, and for an original 1% FPP, it triples. It gets much worse as the overloading increases: 200% loaded results in a 9% actual FPP based on a 1% original FPP. Keep in mind that the expected overloading is probably not as low as 200% given that the number of values per page can vary from tens to tens of thousands.

I think there are 2 approaches to fixing this. First, there's a paper, Scalable Bloom Filters, that has a strategy to use a series of bloom filters so you don't have to know the size in advance. It's a good paper, but we would want to change the heuristics for growing the filter because we know when we are getting close to the total number of elements in the page. Another draw-back is that it uses a series of filters, so testing for an element has to be done in each filter.

I think a second approach is to keep the data in memory until we have enough to determine the properties of the bloom filter. This would only need to be done for the first few pages, while memory consumption is still small. We could keep the hashed values instead of the actual data to get the size down to a set of integers that will be approximately the number of uniques items in the page (minus collisions). I like this option better because it is all on the write side and trades a reasonable amount of memory for a more complicated filter. The read side would be as it is now.

Okay, this is long enough. I'll clean up the spreadsheet I'm basing my numbers on and share it tomorrow.

asfimport commented 9 years ago

Ryan Blue / @rdblue: I just posted a google spreadsheet to back up the numbers in the above comment. It has 2 pages:

Filter size compared to encoded data size by FPP and expected uniqueness
Effective FPP when a filter is overloaded

It would be great to have some people double-check those calculations and my conclusions above.

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: Hi @rdblue, really appreciate for your long comments and the concrete data. To ensure I follow your points well, I’d like to make a short summary at first. For current solution or design, we get two cons. The first is taking the space efficient as a consideration. According to the calculations in the sheet, the bloom filter bit set will occupied much more space than expected. The second is about the approach to obtain the setting for the expected number of entries. For the first one, I am thinking about adding a header (kind of Statistics header) as the dictionary did. We may create a map-like data structure with datapage as key and bloom filter as value. Just a rough idea and more investigation needed here. WRT the setting of expected numbers, I like your second idea too. We could obtain it at the runtime and write to the bloom filter when flush happened. Any thoughts here?

Thank you! Ferd

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: I did a check for some entries in the 1st page and got the same result as you. Do you mind adding some explanations for the second page? I don't follow this part. Thank you!

asfimport commented 9 years ago

Ryan Blue / @rdblue: No problem.

The second page calculates the effective false-positive probability given a filter of some size and an amount of overloading. The first table calculates the size of a bloom filter for a given FPP and number of expected values. The second table to the right of it shows the actual FPP for all of the filter sizes on the left if they are overloaded by the overloading factor, the green box just below.

For example, the table on the left calculates a size for storing 512 values with a 1% FPP: 614 bytes. The table on the right then multiplies the number of values by the overloading factor: 512 * 1.25 = 640. Then assuming we stored 640 values in that 614 byte filter, it calculates that the actual FPP will be 2.5% instead of the 1% FPP we wanted.

This shows that we need to base the size of a filter on the actual number of values stored. Like I said above, overloading a 1% filter with 125% of its capacity results in a 2.5% actual FPP. 200% load results in a 10% actual FPP. And the actual expectation is that the capacity we guess would be off by an order of magnitude, not just double.

asfimport commented 9 years ago

Ryan Blue / @rdblue: I should also point out there's a table on the first page that calculates the probability of at least one false-positive when querying multiple items. That's pretty useful to apply here. If we are querying for 10 items and the bloom filter says it is 1%, then there is a 9.56% chance of reading a page when it has none of the items. But if the actual FPP of that filter is 10% because of overloading, then we get a 65% probability when we were expecting that 9.56%.

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: Hi @rdblue, I have some thoughts for the bloom filter about the space efficiency. At first, I think we should define in which level the bloom filter takes effect. The bloom filter is a complement to the dictionary. For page level, we have already the dictionary page which helps us filter data page. In the upper level, we could use bloom filter to filter the column chunk without parsing the dictionary page. Serving for this purpose, we could do some changes on the current implementations. Now bloom filter statistics is part of the statistics stored with the data page header. It's not a good design since it used more space than expectations. So I am thinking about making the bloom filter statistics as part of ColumnChunk instead. One extra benefits we can obtain is that we can postpone the time for constructing the bloom filter. In this way, we can do the construction of bloom filter in the flush method. In this stage, we have a better understand about how data is like(how much unique value there is). Any suggestions on this? We could have several rounds of discussions and do the POC work once completed.

Regards, Ferd

asfimport commented 9 years ago

Ryan Blue / @rdblue: @winningsix, I think you're right about the conclusions from my last note:

The filter is larger than expected and might not be worth storing
How to avoid over-filling

For the first, we need to plan how to determine whether or not a bloom filter should be written. For the second, I think the solution I proposed is a good enough estimate, though we will probably want to fail fast when determining the rules under which the bloom filter will not be needed. That way we can avoid the memory overhead of keeping those hashes around.

I've been thinking more about where we will put this data in the format and whether the data will be per page or per row group. We definitely want to be able to skip entire row groups. But if the filter is large, it would be nice to be able to skip pages as well because one positive check would require reading the whole row group, even if the value is in a single page. I think it makes sense to use page-level filters but keep them at the column-chunk level (index page?) so they can be used for both filtering row groups and pages.

I've added another page to the bloom filter spreadsheet for some calculations around this, in the "row group filters" page. I compared 2 strategies:

The idea from the Scalable Bloom Filters paper to geometrically decrease the FPP
To estimate the number of pages and calculate a uniform per-page FPP that results in an overall row group FPP (page-FPP = 1 - (1 - row-group-FPP)^(1/num-pages))

The second strategy is the clear winner. The FPP for pages is the same for all pages in a row group, you get an overall FPP for the row group, and the size required is under 3x the size of one big bloom filter for the row group. This also performs much better (space and FPP) than the scalable bloom filters strategy when there are few pages. I think the drawback is that this will limit the ideal number of pages per row group.

So I'm proposing we go with this tentative design:

Build one bloom filter per page (or one for the dictionary page) using the strategy I proposed above (keep hashes around, but not data)
Collect all page filters and write them in index pages to allow skipping the filters since they may be quite large
Make the FPP setting control the column chunk/row group FPP and calculate the per-page FPP to make that happen. Default at around 10% should work.
Base fallback on the filter size as a fraction of the total column chunk size: only write a bloom filter if it is less than X% of the total size.

Then we would have a bunch of rules for detecting that the bloom filter is too large so that we can free up memory early.

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: Hi @rdblue, I have updated two related PRs(https://github.com/apache/parquet-format/pull/28 and https://github.com/apache/parquet-mr/pull/215). Could you help me review them? Thank you. In the latest patch, the bloom filter bitset will not be stored in page level. It will reduce the extra space significantly.

asfimport commented 9 years ago

Ryan Blue / @rdblue: @winningsix, I think we need a design doc for this feature and some data about it before building an implementation. There are still some unknowns that I don't think we have designed enough. I don't think the current approach that mirrors ORC is appropriate because we don't know the number of unique values in pages and the filters are very sensitive to over-filling.

asfimport commented 9 years ago

Ferdinand Xu / @winningsix: Thank you for your feedback. I worked on the patch working on CDH 5.5. It brings ~2.6X performance improvement at the cost of 3% extra space when executing query on a data set of 1.5G if I disable the min/max statistics. Since there's a big divergence in the code base between CDH 5.5 and the master, some further work are needed to make it work on the upstream. I think we need not calculate the number of unique value since the same value will result in the same hash value by executing the same hash functions. If the expected number is higher than the real value, it should be OK. So the problem will change to how we get the total number of pages. I will think about it and work on the design document. Thank you!

asfimport commented 8 years ago

Ferdinand Xu / @winningsix: Hi @rdblue, I have a basic idea about how to estimate the expected entries required by bloom filter. AFAIK we can’t get the row count for each row group before all data are flushed into the disk. Since this we can estimate the size in the following way. For the first row group, we don’t create bloom filter statistics for it at the beginning. By flushing the first row group, we’re able to have a general idea of the row counts. For the rest of the row groups, we will choose this row count to create the bloom filter bit set. We can do a small improvement for the strategy above. We have the size for the whole row group. We can calculate the expected entry number based on the average size for the first 100 or 1000 rows. Since the characteristic of bloom filter, we need to store these items in a tmp buffer. Once the bloom filter bit set is created, we will flush these data into bit set and then drop them. One thing I want to highlight is that we don’t need to know the exact row count and an estimated value is enough. Any thoughts about the idea?

asfimport commented 7 years ago

Constantin Muraru / @costimuraru: Any news on this one? This would be great.

asfimport commented 7 years ago

Ryan Blue / @rdblue: @costimuraru, dictionary-based filters were added that satisfy much of the need for bloom filters. We've not seen a use case yet where bloom filters would help beyond what dictionary filters can do. What is your use case that you think bloom filters would work for?

asfimport commented 7 years ago

Ferdinand Xu / @winningsix: It's very useful when trying to filter non-partitioning column. With this patch, we could obtain the following performance acceleration in customer's environment. For Bloom Filter in Impala, initial test results shows it brings about 2X faster when query an existent item, and about 15X faster when query an nonexistent item.

asfimport commented 7 years ago

Ryan Blue / @rdblue: @winningsix, I agree that this type of filtering is a good thing, but I haven't seen cases where bloom filters are necessary when we could have used dictionary-based filtering to do the same thing.

Since we're already producing and storing dictionaries, which have no false positives, using those is a much better strategy. This is only applicable for columns that aren't dictionary-encoded, or in other words those columns with a large number of distinct values. For those cases, I think you're better off using a larger dictionary or you're unlikely to get a benefit from a bloom filter because of the size needed.

asfimport commented 7 years ago

Ferdinand Xu / @winningsix:

This is only applicable for columns that aren't dictionary-encoded, or in other words those columns with a large number of distinct values.

Agree.

For those cases, I think you're better off using a larger dictionary or you're unlikely to get a benefit from a bloom filter because of the size needed.

If one column is not using dictionary encoded, converting it to dictionary encoded will increase its size based on its distinct value size. In some of our test and customers environment, enabling Bloom filter only have 3% extra space cost to archive 3X ~ 7X performance improvement.

apache / parquet-java

Add bloom filters to parquet statistics #1468

Subtasks:

Related issues:

PRs and other links: