Open Jefffrey opened 1 year ago
So I took an initial stab at this: https://github.com/Jefffrey/arrow-datafusion/commit/b5548e047d45ec8f286d2f84773feb87a21a3939
I found one issue which originated from how the SQL planner was generating Expr
s from the SQL AST, specifically how it searches the schema for a matching column:
It calls field_with_name(...)
which eventually flows to index_of_column_by_name(...)
:
Where can see it finds the first match only (if exists), ignoring the case where there are multiple matches. Specifically here can have multiple matches as per original issue:
e.g. given the original issue, where in same schema can have fields s1.t.b
and s2.t.b
, where b
is the column name and s1.t
and s2.t
are the qualifiers, when searching for t.b
(qualifier t
and column name b
), it'll match for both of them.
I tried to preserve the original behavior of allowing multiple matches, and introduced a scoring system to try pick the best match, but it still didn't solve the issue, as I believe somewhere else down the line in the planning, it still resolves the column without doing a proper ambiguity check.
I wonder if it's better to try centralize ambiguity checks somewhere, instead of trying to hunt down the different places that can resolve columns and implementing the checks there. Like a new analysis rule to resolve column references, though would require changes to planner (large impact?).
Thoughts @alamb ?
I wonder if it's better to try centralize ambiguity checks somewhere, instead of trying to hunt down the different places that can resolve columns and implementing the checks there. Like a new analysis rule to resolve column references, though would require changes to planner (large impact?).
Yes, I think consolidating the ambiguity checks and resolving column references (where they can be more easily documented and unit tested) would be very valuable and help DataFusion be easier to work with and understand
Describe the bug
If joining two identical tables from different schemas, and selecting a column using a table qualifier as part of the identifier, it should do ambiguity check and fail if referring to an ambiguous column.
To Reproduce
Via datafusion-cli:
Can see there is identical table
t
in both schemass1
ands2
, and selectingt.b
column (see it's qualified with table) should do ambiguity check as could be in either table.To note, if column is not qualified at all and left as
b
then ambiguity check will occur and return error.Similarly if schema & table are identical but in separate catalogs, issue also occurs:
Expected behavior
Should return error about ambiguous column
Additional context
Ambiguity check was fixed in https://github.com/apache/arrow-datafusion/pull/5509 but seems this only accounted for unqualified columns, not qualified ones as well.