apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.28k stars 3.47k forks source link

[C++][Parquet] Would Arrow::FileReader support filter evaluating and optimize by BloomFilter #33683

Open Dream-hu opened 1 year ago

Dream-hu commented 1 year ago

Describe the usage question you have. Please include as many useful details as possible.

Parquet cpp has implemented BloomFilter however Arrow FileReader or any one else never call it during reading. I am confused and want to figure out: (1) Does Arrow::FileReader has plan to support filter push down, and when? (2) When and how BloomFilter will be use by Arrow::FileReader?

Looking forward to reply,thx a lot!

Component(s)

Parquet

mapleFU commented 1 year ago

Currently, arrow parquet for C++ not support reading / writing BF.

Impala and parquet-mr supports it, maybe you can take a look there.

alippai commented 1 year ago

I'm a little bit lost here as well. I see that the Parquet 12.0.0 release notes contain your new BF contributions, but that doesn't mean if checking for value equality, the pyarrow.parquet.read_table() would actually use it. Also pyarrow doesn't write it either, right?
Are there any issues / draft PRs open for tracking the BF support? Searching GitHub issues didn't come up with good results.

I'm not a fan of creating many issues ahead, but this is a feature with specification (even if it's optional) so it's not likely the requirements or the ecosystem would change a lot in the future

westonpace commented 1 year ago

Support for reading bloom filters from parquet files into memory was added in 12.0.0. There is an open issue for using this feature to do pushdown filtering here: https://github.com/apache/arrow/issues/27277

The datasets feature was already doing some pushdown using the parquet file statistics. That issue asks to also use the bloom filter for pushdown filtering for datasets.

The parquet reader itself hasn't done pushdown in the past, but I'd be generally in favor of moving the pushdown filtering out of the datasets layer and into the file reader layer itself if someone was motivated to do the work. That would be more complex than just adding bloom filter filtering support to the datasets layer though because you'd have to figure out how to formulate filter expressions (you could add a dependency on arrow expressions but I'm not sure if that makes sense in the parquet layer).

mapleFU commented 1 year ago

I'll finish the dataset scanner part it this month if no other people interested in it @westonpace

alippai commented 1 year ago

Regarding the expressions: Maybe it’s an overkill but would using the filter subset of substrait work?

mapleFU commented 1 year ago

All depends on data distribution and user's query. Maybe it could make query faster. The worst case may make query slower

westonpace commented 1 year ago

I'll finish the dataset scanner part it this month if no other people interested in it @westonpace

That would be great, thanks.

westonpace commented 1 year ago

Maybe it’s an overkill but would using the filter subset of substrait work?

That probably is overkill though it would work if someone had a desire. I believe bloom filters are only useful for equality / inequality. The statistics support comparison. So you probably just need =,!=,<,>,<=,>=. The simplest thing to do might be to do what we used to do for the old python datasets and accept disjunctive normal form:

Predicates are expressed using an Expression or using the disjunctive normal form (DNF), like [[('x', '=', 0), ...], ...]. DNF allows arbitrary boolean logical combinations of single column predicates. The innermost tuples each describe a single column predicate. The list of inner predicates is interpreted as a conjunction (AND), forming a more selective and multiple column predicate. Finally, the most outer list combines these filters as a disjunction (OR).

mapleFU commented 1 year ago

Sorry for late reply because I'm a bit busy these days. I found a problem that bloom filter is not trival, it might enhance the performance, and might not. Should I add an use_bloom_filter options in ParquetFragmentScanOptions ?

/// \brief Per-scan options for Parquet fragments
class ARROW_DS_EXPORT ParquetFragmentScanOptions : public FragmentScanOptions {
 public:
  ParquetFragmentScanOptions();
  std::string type_name() const override { return kParquetTypeName; }

  /// Reader properties. Not all properties are respected: memory_pool comes from
  /// ScanOptions.
  std::shared_ptr<parquet::ReaderProperties> reader_properties;
  /// Arrow reader properties. Not all properties are respected: batch_size comes from
  /// ScanOptions. Additionally, dictionary columns come from
  /// ParquetFileFormat::ReaderOptions::dict_columns.
  std::shared_ptr<parquet::ArrowReaderProperties> arrow_reader_properties;
};

@westonpace

westonpace commented 1 year ago

Sorry for late reply because I'm a bit busy these days. I found a problem that bloom filter is not trival, it might enhance the performance, and might not. Should I add an use_bloom_filter options in ParquetFragmentScanOptions ?

Yes, that would be a good place for it. We would want a comment that provides users with enough information to help make the correct choice. For example "This feature allows parquet bloom filters to be used to reduce the amount of data that needs to be read from the disk. However, applying these filters can be expensive and, if the filter is not very selective, may cost more CPU time than they save." (I don't know if that is the actual reason, feel free to modify as appropriate based on your testing)

mapleFU commented 1 year ago

I've two problem about this:

First, I found it a bit hard to implement it using current framework. Our input is an expression tree, bloom filter can transform:

To an "false" in row-group expressions. So I need to:

  1. match these expressions
  2. If I found we can turn them to false, transform it.

Can you help me what expression handling method can I use?

Second, ParquetFileFragment::Subset should ensure metadata already in, so maybe I should add an options, and load all bloomfilter in memory. I guess I would be time-consuming. Should I load them all at first, or loading them by requirement?

@westonpace .

mapleFU commented 1 year ago

I've fixed (1) with ModifyExpression, but (2) remains a problem

westonpace commented 1 year ago

Sorry for the slow response. For (2) I think you will want to do something similar to SimplifyWithGuarantee: https://github.com/apache/arrow/blob/9736dde84bb2e6996d1d12f6a044c33398e3c3a3/cpp/src/arrow/compute/expression.cc#L1332

I don't think you can use SimplifyWithGuarantee directly because the bloom filter is a weird sort of guarantee. So I think you will need a new method. Specifically for (2) I think you want to:

mapleFU commented 1 year ago

Got it, I'll have a try on it

arthurpassos commented 5 months ago

I see this is still open and I have the same original question: can bloom filters be utilized through Arrow Read API?

@mapleFU

mapleFU commented 5 months ago

Generally not implemented yet. I'll repick that after https://github.com/apache/arrow/pull/37400/files is merged