apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

Consolidate interval analysies from `Interval` and `PruningPredicate` #7887

Open alamb opened 8 months ago

alamb commented 8 months ago

Is your feature request related to a problem or challenge?

We now have two ways to do range / interval analysis in DataFusion.

Having two representations is challenging because we have to implement the same logic in two places. For example,

The rewrite used by the PruningPrediate logic is tricky to understand and only handles very specific predicate forms (see https://github.com/apache/arrow-datafusion/pull/9184 for an essay on the topic and https://github.com/apache/arrow-datafusion/issues/9230 for an example of getting it wrong). Thus it is hard to extend the number of functions / types of predicates that are supported.

The existing range analysis are:

Interval based analysis

The ExprIntervalGraph library is used for cardinality estimation and range analysis for the symmetric hash join, and it:

  1. Can handle arbitrary expressions (such as a < b, not just constants a < 5)
  2. Has a story for how it would support user defined functions (each function would implement a range analysis)

Pruning Predicate Pruning Predicate is used to prune row groups based on min/max values which:

  1. Is vectorized (is efficient to evaluate over 1000s of containers)
  2. Handles the common cases of col <op> constant (such as a = 5, or a < 100), and conjunctions of them
  3. It is not clear (to me at least) how the rewrite that is used can handle arbitrary expressions (e.g. a < b)

Describe the solution you'd like

I would like to rewrite PruningPredicate to use ExprIntervalGraph, measuring and possibly improving the performance of ExprIntervalGraph

The benefits would be:

  1. We would have a single code path to extend and maintain
  2. We would have a clear path for handling arbitrary predicates to prune row groups

Describe alternatives you've considered

Doing so would likely require extending the interval analysis to support more operators (like IN lists) to reach feature parity with the current PruningPredicate rewrite

Additional context

There was a lot of discussion of this topic on the PR that originally introduced Intervals: https://github.com/apache/arrow-datafusion/pull/5322#discussion_r1114324054 between @ozankabak @metegenez and myself

alamb commented 8 months ago

@ozankabak and @tustvold I swear we have talked about this topic before but I could not find an existing ticket or discussion. Do you have any other pointers to past discussions?

berkaysynnada commented 8 months ago

I am not sure this is what you searched for but there was an issue https://github.com/apache/arrow-datafusion/issues/5535.

Actually, I have tried to apply cp_solver strategy to prune row groups. But we observed a performance degradation since this method sacrifices vectorized computing power, meaning that the process needs to be run for each set of statistics. As the number of sets increases, the efficiency decreases.

I will again think about how to insert Interval library there without sacrificing performance.

alamb commented 8 months ago

I am not sure this is what you searched for but there was an issue https://github.com/apache/arrow-datafusion/issues/5535.

Thank you -- this is exactly what I was looking for.

I will again think about how to insert Interval library there without sacrificing performance.

I may be able to help with this too. One way is use Datum (from arrow) (rather than ScalarValue)in theInterval` representation -- that way each can store a single value or multiple.

alamb commented 4 months ago

I just updated this ticket's description with a more coherent story and examples that @appletreeisyellow and I have hit recently while working on in https://github.com/apache/arrow-datafusion/issues/9171

We were talking today and I think @appletreeisyellow may try to prototype what this solution could look like, if she has time, to move the conversation forward.

appletreeisyellow commented 4 months ago

Thank you @alamb for the updating the description