Open jonmmease opened 1 year ago
Looks like there's something going wrong with column naming. I added some print statements to this function to log the left
, right
, and on
arguments.
Without the DISTINCT
qualifier, it looks like this:
left: {Column { name: "colA", index: 0 }, Column { name: "colB", index: 1 }}
right: {Column { name: "colB", index: 0 }, Column { name: "colC", index: 1 }}
on: [(Column { name: "colB", index: 1 }, Column { name: "colB", index: 0 })]
With the DISTINCT
qualifier, it looks like this:
left: {Column { name: "colA", index: 0 }, Column { name: "colB", index: 1 }}
right: {Column { name: "tbl.colB", index: 0 }, Column { name: "colC", index: 1 }}
on: [(Column { name: "colB", index: 1 }, Column { name: "colB", index: 0 })]
So I think the direct cause of this error is that the colB
column gets named tbl.colB
at some point during physical planning.
Commenting out all of the physical optimizers does not fix the issue
Question if anyone has made it this far, is it valid for the name
of physical Column
instances to have the form {table_name}.{column_name}
?
I'm wondering if it's an error for Column { name: "tbl.colB", index: 0 }
to exist at all, or if that's fine and the error is that join isn't equating tbl.colB
with colB
.
The issue looks very similar to https://github.com/apache/arrow-datafusion/issues/4794#issuecomment-1369323927
and the cause is probably same as described in https://github.com/apache/arrow-datafusion/issues/4794#issuecomment-1382626825
Thanks making those connections @askoa.
Also, after playing with it some more I found that commenting out the SingleDistinctToGroupBy
logical plan optimizer rule "fixes" the issue I'm seeing in this query.
I think this is very related to https://github.com/apache/arrow-datafusion/pull/4050 by @andygrove
Here is the optimized logical plan that's generated (with SingleDistinctToGroupBy
in place) for this issue's query:
Projection: tbl.colA, q1.colB, q1.colC
Inner Join: Using tbl.colB = q1.colB
TableScan: tbl projection=[colA, colB]
SubqueryAlias: q1
Projection: tbl.colB, COUNT(DISTINCT tbl.colA) AS colC
Projection: group_alias_0 AS tbl.colB, COUNT(alias1) AS COUNT(DISTINCT tbl.colA)
Aggregate: groupBy=[[group_alias_0]], aggr=[[COUNT(alias1)]]
Aggregate: groupBy=[[tbl.colB AS group_alias_0, tbl.colA AS alias1]], aggr=[[]]
TableScan: tbl projection=[colA, colB]
The group_alias_0 AS tbl.colB
fragment (which is introduced by the SingleDistinctToGroupBy
optimizer rule) creates a new unqualified column named "tbl.colB", which isn't the same thing as the original qualified column "tbl"."colB". The join on tbl.colB = q1.colB
then fails to to match the "tbl.colB" column during physical planning.
Describe the bug
I'm seeing an error during physical planning for the following query
where
tbl
is a table with columns:colA
andcolB
(both of typeUInt64
).Interestingly, planning and query evaluation work properly when the
DISTINCT
qualifier is removed from thecount
aggregation.Context
The purpose of this query is to add a new column (
colC
) to the input table that contains the number of unique values ofcolA
that correspond to each value ofcolB
. This is a simplified reproduction of an issue that we're seeing in VegaFusion's implementation of the Vega pivot transform.To Reproduce
Here is a Rust test that reproduces the error:
Expected behavior
Physical planning should complete without error.