audit and create a document for bloom filter configurations

jimexist commented 2 years ago

    Thank you @Jimexist  -- this is very cool. I went through the code fairly thoroughly. I had some minor suggestions / comments for documentation and code structure but nothing that would block merging.

I think the biggest thing I would like to discuss is "what parameters to expose for the writer API". I was thinking, for example, will users of this feature be able to set "fpp" and "ndv" reasonably? I suppose having the number of distinct values before writing a parquet file is reasonable, but maybe not the expected number of distinct values for each row group.

I did some research of other implementations. Here are the spark settingss https://spark.apache.org/docs/latest/configuration.html

spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold	10MB	Size threshold of the bloom filter creation side plan. Estimated size needs to be under this value to try to inject bloom filter.	3.3.0
spark.sql.optimizer.runtime.bloomFilter.enabled	false	When true and if one side of a shuffle join has a selective predicate, we attempt to insert a bloom filter in the other side to reduce the amount of shuffle data.	3.3.0
spark.sql.optimizer.runtime.bloomFilter.expectedNumItems	1000000	The default number of expected items for the runtime bloomfilter	3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumBits	67108864	The max number of bits to use for the runtime bloom filter	3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumItems	4000000	The max allowed number of expected items for the runtime bloom filter	3.3.0
spark.sql.optimizer.runtime.bloomFilter.numBits	8388608	The default number of bits to use for the runtime bloom filter	3.3.0

the arrow parquet C++ writer seems to allow for the fpp setting

https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE

double bloom_filter_fpp = 0.05
The upper limit of the false-positive rate of the bloom filter, default 0.05.

Databricks seems to expose the fpp, max_fpp, and num distinct values: https://docs.databricks.com/sql/language-manual/delta-create-bloomfilter-index.html

Originally posted by @alamb in https://github.com/apache/arrow-rs/pull/3119#pullrequestreview-1186585988

jimexist commented 2 years ago

this is considered a follow up of:

3023

jimexist commented 2 years ago

FYI in spark there's also a document regarding options that can be set for parquet bloom filter: https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html

alamb commented 2 years ago

Do you have any suggestions? After a few more days of thought I don't have anything better than ndv and fpp.

The only other possibly I have is to keep this crate simpler and simply expose set_bloom_filter_size and have the users explicitly specify the size. It isn't ideal, but perhaps it would be ok if we added a pointer to the canonical ndv/fpp calculations?

jimexist commented 2 years ago

@alamb i believe we should start simple, to support only 2 params:

whether bloom filter is enabled as a master switch
fpp (0, 1.0), with which we'd assume all unique items, and use that row count per row group to calculate a bitset size, but cap that to 128MiB for unreasonably small fpp e.g. 0.0000001; for very large fpp e.g. 0.9999 the minimal is 32.

controlling disk size does not quite make sense or is counter intuitive because users then need to both estimate unique number of items per row group as well as know how to derive fpp from that - in most cases, having a maxinum fpp is good enough

cc @tustvold

alamb commented 2 years ago

I like the idea of specifying fpp (and it follows the arrow C++model)

with which we'd assume all unique items

I think that makes sense as the main use case for bloom filters is high cardinality / close to unique columns.

Perhaps we can document the case clearly (aka "bloom filters will likely only help for almost unique data like "ids" and "uuids", for other types sorting /clustering and min/max statistics will work as well if not better)

jimexist commented 2 years ago

turns out i have to allow users to specify ndv and have that defaults to say 1 million. the current code architect requires flow encoding which means there's no good way to know in advance how many num of rows will be written.

alamb commented 2 years ago

label_issue.py automatically added labels {'parquet'} from #3165

apache / arrow-rs

audit and create a document for bloom filter configurations #3138

3023