apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.21k stars 957 forks source link

Use `min_value` and `max_value` on statistics to avoid `ExecutionPlan.execute` #10400

Open samuelcolvin opened 1 week ago

samuelcolvin commented 1 week ago

Describe the bug

Maybe related to #5535, but I couldn't find anything identical, so created a fresh issue.

If this is a known bug and you think the fix might be moderate in scope, I'm happy to have a go at fixing it?

To Reproduce

I have a custom TableProvider and ExecutionPlan, where calling execute is somewhat expensive and I want to avoid calling it if no data will match.

The execution plan can return helpful statistics from .statistics(), including for example, for one column:

...
ColumnStatistics {
    null_count: Precision::Exact(0),
    max_value: Precision::Exact(ScalarValue::Int64(Some(4))),
    min_value: Precision::Exact(ScalarValue::Int64(Some(4))),
    distinct_count: Precision::Exact(1),
},

E.g. "in this column all values are equal to 4". This is successfully used by Datafusion if I query value is null, the execute() function is never alled.

But if I query value > 5 or value < 0, the statistic is ignored and execute() is still called.

Expected behavior

min_value and max_value of ColumnStatistics should be used for pruning and the query plan should not require the "slow" execute method to be called.

Additional context

I can give a fairly minimal example if required, but I thought best to report the issue and check if it was well known before going to that effort?

I've tried this on both main (as of today) and 37.1.0

samuelcolvin commented 1 week ago

For anyone else looking for this, the solution is to implement supports_filters_pushdown on the table, then use that to apply filters within scan.

Feel free to close this if you like.

alamb commented 1 week ago

I think using the reported statistics to prune using datafusion::physical_optimizer::pruning::PruningPredicate would be a nice improvement (so each table provider didn't have to apply the basic min/max filters themselves)

alamb commented 1 week ago

Relabed from bug to feature -- thanks @samuelcolvin

samuelcolvin commented 1 week ago

@alamb using PruningPredicate makes sense, but please can you point me at where I need to make changes to add this functionality?

alamb commented 5 days ago

@alamb using PruningPredicate makes sense, but please can you point me at where I need to make changes to add this functionality?

I recommend updating the existing though poorly named AggregateStatistics pass

And there you could potentially call ExecutionPlan::statistics() and use PruningPredicate to prove that some predicates can't be true and replace them with PlaceholderRowExec

alamb commented 5 days ago

(Thank you for working on this, bTW)

edmondop commented 4 days ago

@samuelcolvin have you already started to write code?

samuelcolvin commented 4 days ago

No code yet, if you'd like to work on this, feel free.

samuelcolvin commented 4 days ago

Actually I'm on a flight today, so might have some time to work on this.

samuelcolvin commented 3 days ago

Progress update, I've got min & max stats pruning "working" in our code, however I immediately ran into #10536, I'll let you know how I get on.

alamb commented 3 days ago

You know it occurs to me that @dmitrybugakov / @jayzhan211 may be working on a similar feature with a different approach on https://github.com/apache/datafusion/issues/10456 🤔 I