apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.03k stars 1.14k forks source link

Reorder boolean expressions (including filter predicates) according to evaluation cost / selectivity #11262

Open Dandandan opened 3 months ago

Dandandan commented 3 months ago

Is your feature request related to a problem or challenge?

After https://github.com/apache/datafusion/pull/11247 is merged we can look at ordering the boolean expressions according to a measure of evaluation cost.

Describe the solution you'd like

We can reorder expressions:

E.g. a expression like the following: URL LIKE '%google%' AND code = 404.

Likely would be better reordered to code = 404 AND URL LIKE '%google%' in order to benefit most from short circuiting as code = 404 is less expensive. One could also combine it with the estimate of selectivity to further optimize the order (low selectivity, batches more likely to be all false, high selectivity, batches more likely to be all true)

Describe alternatives you've considered

No response

Additional context

No response

suibianwanwank commented 3 months ago

I've seen discussions about predicate reordering in the calcite community before, and one of the big problems is that the engine doing reordering of predicates invalidates the user-designed order of predicates, if the user understands that our short circuit optimisation writes the sql as a better order, but the engine reordering invalidates his efforts.

Dandandan commented 3 months ago

I've seen discussions about predicate reordering in the calcite community before, and one of the big problems is that the engine doing reordering of predicates invalidates the user-designed order of predicates, if the user understands that our short circuit optimisation writes the sql as a better order, but the engine reordering invalidates his efforts.

Good call, if we do it, it needs to be configurable so users/engines can disable the optimization.

alamb commented 3 months ago

We could potentially do some simple heuristics that would catch the common case -- like "treat regexp as very slow and do them after other predicates"