Open samuelcolvin opened 1 week ago
For anyone else looking for this, the solution is to implement supports_filters_pushdown
on the table, then use that to apply filters within scan
.
Feel free to close this if you like.
I think using the reported statistics to prune using datafusion::physical_optimizer::pruning::PruningPredicate would be a nice improvement (so each table provider didn't have to apply the basic min/max filters themselves)
Relabed from bug to feature -- thanks @samuelcolvin
@alamb using PruningPredicate
makes sense, but please can you point me at where I need to make changes to add this functionality?
@alamb using
PruningPredicate
makes sense, but please can you point me at where I need to make changes to add this functionality?
I recommend updating the existing though poorly named AggregateStatistics pass
And there you could potentially call ExecutionPlan::statistics()
and use PruningPredicate to prove that some predicates can't be true and replace them with PlaceholderRowExec
(Thank you for working on this, bTW)
@samuelcolvin have you already started to write code?
No code yet, if you'd like to work on this, feel free.
Actually I'm on a flight today, so might have some time to work on this.
Progress update, I've got min & max stats pruning "working" in our code, however I immediately ran into #10536, I'll let you know how I get on.
You know it occurs to me that @dmitrybugakov / @jayzhan211 may be working on a similar feature with a different approach on https://github.com/apache/datafusion/issues/10456 🤔 I
Describe the bug
Maybe related to #5535, but I couldn't find anything identical, so created a fresh issue.
If this is a known bug and you think the fix might be moderate in scope, I'm happy to have a go at fixing it?
To Reproduce
I have a custom
TableProvider
andExecutionPlan
, where callingexecute
is somewhat expensive and I want to avoid calling it if no data will match.The execution plan can return helpful statistics from
.statistics()
, including for example, for one column:E.g. "in this column all values are equal to 4". This is successfully used by Datafusion if I query
value is null
, theexecute()
function is never alled.But if I query
value > 5
orvalue < 0
, the statistic is ignored andexecute()
is still called.Expected behavior
min_value
andmax_value
ofColumnStatistics
should be used for pruning and the query plan should not require the "slow" execute method to be called.Additional context
I can give a fairly minimal example if required, but I thought best to report the issue and check if it was well known before going to that effort?
I've tried this on both
main
(as of today) and37.1.0