Open jayzhan211 opened 1 month ago
Update: #10463 is not what this issue expects, I will see what to do next. I can help it. :) An experimental PR is #10463
On a second glance, I feel it's difficult. š„
When simplifying a logicalplan, it seems impossible to get the underlying data which could making guarantees
.
I think this may be another example of what @samuelcolvin was suggesting on https://github.com/apache/datafusion/issues/10400
I think we could use ExecutionPlan::statistics to get the guarantee information
@jayzhan211 Do I understand correctly that the best option is to incorporate the guarantee logic into the simplifier based on statistics and remove the old version of the guarantee?
@jayzhan211 Do I understand correctly that the best option is to incorporate the guarantee logic into the simplifier based on statistics and remove the old version of the guarantee?
I think so.
old version of the guarantee?
What does "old version of the guarantee?" refer to?
What does "old version of the guarantee?" refer to?
As I understand it, the usecase for GuaranteeRewriter
when @wjones127 (maybe?) added it was for providing external information (outside of information that came from SQL). I don't think we should just remove the ability to do so
I would personally recommend add code that translates Statistics
into Guarantees
to pass to GuaranteeRewriter
We could discuss reworking how GuaranteeRewriter
works as a follow on PR
@dmitrybugakov are you working on #10510?
@alamb Is it reasonable to evaluate column in ConstEvaluator
and collect statistics for guarantee rewriter
or should we avoid evaluation in logical optimization step and compute it in physical planner?
I'm thinking of passing schema
and batch
to ConstEvaluator
to evaluate columns and updating statistics each passes for guarantee rewriter
.
@alamb Is it reasonable to evaluate column in
ConstEvaluator
and collect statistics forguarantee rewriter
or should we avoid evaluation in logical optimization step and compute it in physical planner?
I don't quite follow what you are proposing here.
As I I understand the idea on this ticket, the idea is to add a pass that knows how to use Statistics to simplify expressions by creating a Simplifier, and pass in the min and max values via with_guarantee
The challenges I see are:
LogicalPlan
but only in in ExecutionPlan
via ExecutionPlan::with_statistics
ExprSimplifier::simplify
API is in terms of Expr
s (not PhysicalExprs
)One potential thing you could do is use PruningPredicate
for FilterExec
s and try to prove inputs can never be true. However, that seems like it may not be particularly effective (as the number of queries where a filter will always be false is likely to be limited in importance)
One potential thing you could do is use
PruningPredicate
forFilterExec
s and try to prove inputs can never be true.
It seems quite similar to the comments in #10400.
However, that seems like it may not be particularly effective (as the number of queries where a filter will always be false is likely to be limited in importance)
Maybe I should works on other issue š¤
Maybe I should works on other issue š¤
Maybe -- what are you interested in working on? Are you blocked on review of anything? I find it hard to keep up with what you are doing these days š
Maybe I should works on other issue š¤
Maybe -- what are you interested in working on? Are you blocked on review of anything? I find it hard to keep up with what you are doing these days š
I think #8708 is about 80% complete. I'm exploring the next interesting topic.
I think https://github.com/apache/datafusion/issues/8708 is about 80% complete. I'm exploring the next interesting topic.
Let me know if you would like help breaking down the work and filing some more follow on tickets (to organize getting some additional community help).
Depending on the kind of project you are interested in, here are some ideas (unsolicited) that I would love to help review:
Catalog
APIsRock on!
Improving grouping performance seems interesting!
Improving grouping performance seems interesting!
I think it would be awesome -- thank you. How would you like to proceed? I personally think either https://github.com/apache/datafusion/issues/9403 or https://github.com/apache/datafusion/issues/6937 are super valuable
For either, I think the key will be to do some sort of POC to make sure we can make performance improve before polishing too much.
Looking forward to working with you more
Is your feature request related to a problem or challenge?
While deprecating
Expr::GetIndexedField
, I found there are many test cases that are not covered in sqllogictest, for example,test_inequalities_non_null_bounded
. Since we hope to replace thefield
API withget_field
. We could either move the test todatafusion/core/tests
or sqllogictest. I prefer the latter, then, I found that guarantee rewrite is not applied to SQL workflow.I expect that
FilterExec
should be removed or converted to something likeFalse
, since the condition here is always false.Describe the solution you'd like
Apply
guarantee_rewriter
to sql workflow. If the simplification logic can be included inSimplifier
is a plus.Describe alternatives you've considered
No response
Additional context
PR that introduce guarantee rewrite https://github.com/apache/datafusion/pull/7467
No response