Open alamb opened 9 months ago
This also came up as potentially part of a solution for https://github.com/apache/arrow-datafusion/issues/8582
This is an interesting topic / improvement!
Two suggestions from my side
[NOT] MATERIALIZED
to force / disable CTE materialization. Often it's much faster to disable materialization, as otherwise pushdown / CBO optimizations will not apply (as you suggested). It might be interesting to consider supporting this in DataFusion as well, as it is hard from an optimizer standpoint to decide what to do.
https://www.postgresql.org/docs/current/queries-with.html#QUERIES-WITH-CTE-MATERIALIZATIONwhere the same stream can be consumed at different rates potentially needing to buffer the entire intermediate result or else the plan will deadlock
I think this is very similar to our repartitioning code and the trade-offs and problems we see there. The reason is that a repartition is basically also as single input with mulitiple consumers. Just think of it like the same data but with a column "bool: belongs to this output" added.
Is your feature request related to a problem or challenge?
The core usecase is:
DataFusion will effectively run the subquery
x
three times (it will basically copy theLogicalPlan
forx
wherever it is used.This design has certain benefits:
UNION ALL
arms have different predicates, they could potentially be pushed down in one branch but not the others.Describe the solution you'd like
However, in many cases it would likely be better to do to the expensive join only once and reuse the results like this:
Describe alternatives you've considered
I think there are several considerations for this design, the biggest is that it is a 'diamond' plan where the same stream can be consumed at different rates potentially needing to buffer the entire intermediate result or else the plan will deadlock
For example
Additional context
This came from a discord thread from @sergiimk