Closed jimexist closed 2 years ago
this is considered a follow up of:
FYI in spark there's also a document regarding options that can be set for parquet bloom filter: https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html
Do you have any suggestions? After a few more days of thought I don't have anything better than ndv and fpp.
The only other possibly I have is to keep this crate simpler and simply expose set_bloom_filter_size
and have the users explicitly specify the size. It isn't ideal, but perhaps it would be ok if we added a pointer to the canonical ndv/fpp calculations?
@alamb i believe we should start simple, to support only 2 params:
(0, 1.0)
, with which we'd assume all unique items, and use that row count per row group to calculate a bitset size, but cap that to 128MiB for unreasonably small fpp e.g. 0.0000001; for very large fpp e.g. 0.9999 the minimal is 32.controlling disk size does not quite make sense or is counter intuitive because users then need to both estimate unique number of items per row group as well as know how to derive fpp from that - in most cases, having a maxinum fpp is good enough
cc @tustvold
I like the idea of specifying fpp (and it follows the arrow C++model)
with which we'd assume all unique items
I think that makes sense as the main use case for bloom filters is high cardinality / close to unique columns.
Perhaps we can document the case clearly (aka "bloom filters will likely only help for almost unique data like "ids" and "uuids", for other types sorting /clustering and min/max statistics will work as well if not better)
turns out i have to allow users to specify ndv and have that defaults to say 1 million. the current code architect requires flow encoding which means there's no good way to know in advance how many num of rows will be written.
label_issue.py
automatically added labels {'parquet'} from #3165
I think the biggest thing I would like to discuss is "what parameters to expose for the writer API". I was thinking, for example, will users of this feature be able to set "fpp" and "ndv" reasonably? I suppose having the number of distinct values before writing a parquet file is reasonable, but maybe not the expected number of distinct values for each row group.
I did some research of other implementations. Here are the spark settingss https://spark.apache.org/docs/latest/configuration.html
the arrow parquet C++ writer seems to allow for the fpp setting
https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE
Databricks seems to expose the fpp, max_fpp, and num distinct values: https://docs.databricks.com/sql/language-manual/delta-create-bloomfilter-index.html
Originally posted by @alamb in https://github.com/apache/arrow-rs/pull/3119#pullrequestreview-1186585988