Open nseekhao opened 1 year ago
Is your argument/concern that these two plans would produce different results?
[Original plan]
Projection: d1.b
LeftSemi Join: d1.a = __correlated_sq_1.a Filter: __correlated_sq_1.e != d1.e
SubqueryAlias: d1
TableScan: data projection=[a, b, e]
SubqueryAlias: __correlated_sq_1
SubqueryAlias: d2
TableScan: data projection=[a, e]
[Unoptimized plan from consumer]
Projection: data.b
LeftSemi Join: data.a = data.a Filter: data.e != data.e
TableScan: data projection=[a, b, e]
TableScan: data projection=[a, e]
The aliases won't actually change the results. They appear identical to me.
Or is your concern that the aliases are lost because your application is depending on the aliases for some reason unrelated to the results?
Or is the concern that the lack of aliases is somehow causing the optimizer to generate an incorrect optimization?
Sorry to triple post my stream-of-consciousness. For context, I am asking because this came up in the Substrait community meeting today and the consensus is that this seems to be more of a datafusion issue (if datafusion's optimizer is giving different results with and without aliases) than a Substrait issue. That being said, I think there are things we can do to support aliases in Substrait. I'll post a comment on your issue there as well.
Doing this properly with aliases depends on https://github.com/substrait-io/substrait/issues/571 / https://github.com/substrait-io/substrait/pull/649.
However the problem is more within DataFusion - the Substrait plans are valid, since Substrait only cares about column indices, but DF handles columns by (qualified) name and thus cannot handle duplicate columns. https://github.com/apache/datafusion/pull/11049 I think does fix most of these cases in practice for DF, but not all.
Is your feature request related to a problem or challenge?
If there is a
SubqueryAlias
relation,datafusion-substrait
will bypass it. This works for the producer, the generated Substrait plans are correct. However, the DF plan generated with the consumer will be incorrect since it has no way to distinguish between the different relations that read from the same table.This can be demonstrated in these examples:
The original DF plan is:
once this plan is fed through the producer, we get the correct Substrait plan:
however, if we want to get back a DF plan, and use the consumer, we'll get:
Notice that because there is no way for DF to distinguish between the left
data
table and the rightdata
table, DF thinks they are they are from the sameTableScan
relation. Thus, the output DF plan is incorrect.Describe the solution you'd like
Preserve aliases in Substrait.
Describe alternatives you've considered
N/A
Additional context
Additional example: