Open Dream-hu opened 1 year ago
Currently, arrow parquet for C++ not support reading / writing BF.
Impala and parquet-mr supports it, maybe you can take a look there.
I'm a little bit lost here as well. I see that the Parquet 12.0.0 release notes contain your new BF contributions, but that doesn't mean if checking for value equality, the pyarrow.parquet.read_table()
would actually use it. Also pyarrow
doesn't write it either, right?
Are there any issues / draft PRs open for tracking the BF support? Searching GitHub issues didn't come up with good results.
I'm not a fan of creating many issues ahead, but this is a feature with specification (even if it's optional) so it's not likely the requirements or the ecosystem would change a lot in the future
Support for reading bloom filters from parquet files into memory was added in 12.0.0. There is an open issue for using this feature to do pushdown filtering here: https://github.com/apache/arrow/issues/27277
The datasets feature was already doing some pushdown using the parquet file statistics. That issue asks to also use the bloom filter for pushdown filtering for datasets.
The parquet reader itself hasn't done pushdown in the past, but I'd be generally in favor of moving the pushdown filtering out of the datasets layer and into the file reader layer itself if someone was motivated to do the work. That would be more complex than just adding bloom filter filtering support to the datasets layer though because you'd have to figure out how to formulate filter expressions (you could add a dependency on arrow expressions but I'm not sure if that makes sense in the parquet layer).
I'll finish the dataset scanner part it this month if no other people interested in it @westonpace
Regarding the expressions: Maybe it’s an overkill but would using the filter subset of substrait work?
All depends on data distribution and user's query. Maybe it could make query faster. The worst case may make query slower
I'll finish the dataset scanner part it this month if no other people interested in it @westonpace
That would be great, thanks.
Maybe it’s an overkill but would using the filter subset of substrait work?
That probably is overkill though it would work if someone had a desire. I believe bloom filters are only useful for equality / inequality. The statistics support comparison. So you probably just need =,!=,<,>,<=,>=. The simplest thing to do might be to do what we used to do for the old python datasets and accept disjunctive normal form:
Predicates are expressed using an Expression or using the disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).
Sorry for late reply because I'm a bit busy these days. I found a problem that bloom filter is not trival, it might enhance the performance, and might not. Should I add an use_bloom_filter
options in ParquetFragmentScanOptions
?
/// \brief Per-scan options for Parquet fragments
class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
public:
ParquetFragmentScanOptions();
std::string type_name() const override { return kParquetTypeName; }
/// Reader properties. Not all properties are respected: memory_pool comes from
/// ScanOptions.
std::shared_ptr<parquet::ReaderProperties> reader_properties;
/// Arrow reader properties. Not all properties are respected: batch_size comes from
/// ScanOptions. Additionally, dictionary columns come from
/// ParquetFileFormat::ReaderOptions::dict_columns.
std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
};
@westonpace
Sorry for late reply because I'm a bit busy these days. I found a problem that bloom filter is not trival, it might enhance the performance, and might not. Should I add an use_bloom_filter options in ParquetFragmentScanOptions ?
Yes, that would be a good place for it. We would want a comment that provides users with enough information to help make the correct choice. For example "This feature allows parquet bloom filters to be used to reduce the amount of data that needs to be read from the disk. However, applying these filters can be expensive and, if the filter is not very selective, may cost more CPU time than they save." (I don't know if that is the actual reason, feel free to modify as appropriate based on your testing)
I've two problem about this:
First, I found it a bit hard to implement it using current framework. Our input is an expression tree, bloom filter can transform:
("eq", field_ref, literal)
("is_in", field_ref, literals...)
To an "false" in row-group expressions. So I need to:
Can you help me what expression handling method can I use?
Second, ParquetFileFragment::Subset
should ensure metadata already in, so maybe I should add an options, and load all bloomfilter in memory. I guess I would be time-consuming. Should I load them all at first, or loading them by requirement?
@westonpace .
I've fixed (1) with ModifyExpression
, but (2) remains a problem
Sorry for the slow response. For (2) I think you will want to do something similar to SimplifyWithGuarantee
: https://github.com/apache/arrow/blob/9736dde84bb2e6996d1d12f6a044c33398e3c3a3/cpp/src/arrow/compute/expression.cc#L1332
I don't think you can use SimplifyWithGuarantee
directly because the bloom filter is a weird sort of guarantee. So I think you will need a new method. Specifically for (2) I think you want to:
is_in
/eq
in the AST with literal(false)
.Canonicalize
FoldConstants
Got it, I'll have a try on it
I see this is still open and I have the same original question: can bloom filters be utilized through Arrow Read API?
@mapleFU
Generally not implemented yet. I'll repick that after https://github.com/apache/arrow/pull/37400/files is merged
Describe the usage question you have. Please include as many useful details as possible.
Parquet cpp has implemented BloomFilter however Arrow FileReader or any one else never call it during reading. I am confused and want to figure out: (1) Does Arrow::FileReader has plan to support filter push down, and when? (2) When and how BloomFilter will be use by Arrow::FileReader?
Looking forward to reply,thx a lot!
Component(s)
Parquet