apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

Apply guarantee rewriter to sql workflow #10456

Open jayzhan211 opened 1 month ago

jayzhan211 commented 1 month ago

Is your feature request related to a problem or challenge?

While deprecatingExpr::GetIndexedField, I found there are many test cases that are not covered in sqllogictest, for example, test_inequalities_non_null_bounded. Since we hope to replace the field API with get_field. We could either move the test to datafusion/core/tests or sqllogictest. I prefer the latter, then, I found that guarantee rewrite is not applied to SQL workflow.

statement ok
create table t (c int) as values (1), (3), (5);

query TT
explain select struct(c) from t where c between 3 and 1;
----
logical_plan
01)Projection: struct(t.c)
02)--Filter: t.c >= Int32(3) AND t.c <= Int32(1)
03)----TableScan: t projection=[c]
physical_plan
01)ProjectionExec: expr=[struct(c@0) as struct(t.c)]
02)--RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
03)----CoalesceBatchesExec: target_batch_size=8192
04)------FilterExec: c@0 >= 3 AND c@0 <= 1
05)--------MemoryExec: partitions=1, partition_sizes=[1]

statement ok
drop table t;

I expect that FilterExec should be removed or converted to something like False, since the condition here is always false.

Describe the solution you'd like

Apply guarantee_rewriter to sql workflow. If the simplification logic can be included in Simplifier is a plus.

Describe alternatives you've considered

No response

Additional context

PR that introduce guarantee rewrite https://github.com/apache/datafusion/pull/7467

No response

yyy1000 commented 1 month ago

Update: #10463 is not what this issue expects, I will see what to do next. I can help it. :) An experimental PR is #10463

yyy1000 commented 1 month ago

On a second glance, I feel it's difficult. šŸ˜„ When simplifying a logicalplan, it seems impossible to get the underlying data which could making guarantees.

alamb commented 1 month ago

I think this may be another example of what @samuelcolvin was suggesting on https://github.com/apache/datafusion/issues/10400

I think we could use ExecutionPlan::statistics to get the guarantee information

dmitrybugakov commented 1 month ago

@jayzhan211 Do I understand correctly that the best option is to incorporate the guarantee logic into the simplifier based on statistics and remove the old version of the guarantee?

jayzhan211 commented 1 month ago

@jayzhan211 Do I understand correctly that the best option is to incorporate the guarantee logic into the simplifier based on statistics and remove the old version of the guarantee?

I think so.

alamb commented 1 month ago

old version of the guarantee?

What does "old version of the guarantee?" refer to?

dmitrybugakov commented 1 month ago

What does "old version of the guarantee?" refer to?

https://github.com/apache/datafusion/blob/1d4b808c6f46bead241166fb1ed4646b3d17b01d/datafusion/optimizer/src/simplify_expressions/guarantees.rs#L60

alamb commented 1 month ago

As I understand it, the usecase for GuaranteeRewriter when @wjones127 (maybe?) added it was for providing external information (outside of information that came from SQL). I don't think we should just remove the ability to do so

I would personally recommend add code that translates Statistics into Guarantees to pass to GuaranteeRewriter

We could discuss reworking how GuaranteeRewriter works as a follow on PR

jayzhan211 commented 3 weeks ago

@dmitrybugakov are you working on #10510?

jayzhan211 commented 3 weeks ago

@alamb Is it reasonable to evaluate column in ConstEvaluator and collect statistics for guarantee rewriter or should we avoid evaluation in logical optimization step and compute it in physical planner?

I'm thinking of passing schema and batch to ConstEvaluator to evaluate columns and updating statistics each passes for guarantee rewriter.

alamb commented 3 weeks ago

@alamb Is it reasonable to evaluate column in ConstEvaluator and collect statistics for guarantee rewriter or should we avoid evaluation in logical optimization step and compute it in physical planner?

I don't quite follow what you are proposing here.

As I I understand the idea on this ticket, the idea is to add a pass that knows how to use Statistics to simplify expressions by creating a Simplifier, and pass in the min and max values via with_guarantee

The challenges I see are:

  1. statistics not available in the LogicalPlan but only in in ExecutionPlan via ExecutionPlan::with_statistics
  2. The ExprSimplifier::simplify API is in terms of Exprs (not PhysicalExprs)

One potential thing you could do is use PruningPredicate for FilterExecs and try to prove inputs can never be true. However, that seems like it may not be particularly effective (as the number of queries where a filter will always be false is likely to be limited in importance)

jayzhan211 commented 3 weeks ago

One potential thing you could do is use PruningPredicate for FilterExecs and try to prove inputs can never be true.

It seems quite similar to the comments in #10400.

However, that seems like it may not be particularly effective (as the number of queries where a filter will always be false is likely to be limited in importance)

Maybe I should works on other issue šŸ¤”

alamb commented 3 weeks ago

Maybe I should works on other issue šŸ¤”

Maybe -- what are you interested in working on? Are you blocked on review of anything? I find it hard to keep up with what you are doing these days šŸƒ

jayzhan211 commented 3 weeks ago

Maybe I should works on other issue šŸ¤”

Maybe -- what are you interested in working on? Are you blocked on review of anything? I find it hard to keep up with what you are doing these days šŸƒ

I think #8708 is about 80% complete. I'm exploring the next interesting topic.

alamb commented 3 weeks ago

I think https://github.com/apache/datafusion/issues/8708 is about 80% complete. I'm exploring the next interesting topic.

Let me know if you would like help breaking down the work and filing some more follow on tickets (to organize getting some additional community help).

Depending on the kind of project you are interested in, here are some ideas (unsolicited) that I would love to help review:

  1. API design / extraction: https://github.com/apache/datafusion/issues/10782 with Catalog APIs
  2. Grouping performance -- either https://github.com/apache/datafusion/issues/9403 or https://github.com/apache/datafusion/issues/6937 (both are pretty tricky low level optimizations)

Rock on!

jayzhan211 commented 3 weeks ago

Improving grouping performance seems interesting!

alamb commented 2 weeks ago

Improving grouping performance seems interesting!

I think it would be awesome -- thank you. How would you like to proceed? I personally think either https://github.com/apache/datafusion/issues/9403 or https://github.com/apache/datafusion/issues/6937 are super valuable

For either, I think the key will be to do some sort of POC to make sure we can make performance improve before polishing too much.

Looking forward to working with you more