Open Jefffrey opened 5 months ago
I wonder what "unique" means ? Like every newly created Expr
gets some sort of id?
Some examples:
// would expr1 and expr2 have the same id?
let expr1 = col("foo");
let expr2 = col("foo");
// would expr1 and expr2 have the same id?
let expr1 = col("foo");
let expr2 = expr1.clone()
Actually I think I was off the mark on what ExprId
is intended to do, it seems it would be more useful if there were a new LogicalExpr enum such as AttributeReference
, which would refer to an expr from the parent plan by ExprId
Like given a logical plan:
Projection: a.int_col, b.double_col, CAST(a.date_string_col AS Utf8)
Inner Join: a.int_col = b.int_col
SubqueryAlias: a
Projection: alltypes_plain.int_col, alltypes_plain.date_string_col
Filter: alltypes_plain.id > Int32(1)
TableScan: alltypes_plain projection=[id, int_col, date_string_col], partial_filters=[alltypes_plain.id > Int32(1)]
SubqueryAlias: b
Projection: alltypes_plain.int_col, alltypes_plain.double_col
Filter: CAST(alltypes_plain.tinyint_col AS Float64) < alltypes_plain.double_col
TableScan: alltypes_plain projection=[tinyint_col, int_col, double_col], partial_filters=[CAST(alltypes_plain.tinyint_col AS Float64) < alltypes_plain.double_col]
That top level projection has a.int_col
as a Column
for example, which when turned into physical plan needs to search the parent schema by name
Whereas with exprid's, it could be possible for a.int_col
to be an AttributeReference which references the parent expr list to point to which expr it references by id.
And I think each new expr would have a new ID.
Honestly I could be way off the mark here on the usages/benefits of exprid 😅
It's just something I was thinking about, especially in relation to how verbose it can be to check if columns are the same when taking into account table, schema and catalog parts of the identifier for a column
So instead of having to find the original column of a projected column in a logical plan via name during logical optimization/physical planning, could have that done once off in an analyzer rule pass then afterwards use exprids
What you describe seems similar to the usize
index
in physical_plan
Column references. And as I recall @houqp added exactly to handle this case of ambiguous schema 🤔
I believe Postgres has a similar way of addressing fields (it does so by index rather than column name) in its logical exprs.
The downside I remember from working wit postgres was that there were many potential hard to track down issues when the indexes got messed up.
It might be interesting to see what this would look like in DataFusion 🤔
Related to this, SQLite and Postgres allow duplicate column names in results. From SQLite tests:
SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1
Datafusion errors out with
query failed: Error during planning: Projections require unique expression names but the expression "COUNT(*) AS col1" at position 0 and "Int64(-92) - Int64(-47) AS col1" at position 1 have the same name. Consider aliasing ("AS") one of them.
Maybe after/as part of implementing this, Datafusion should relax that restriction?
Maybe after/as part of implementing this, Datafusion should relax that restriction?
One challenge is that most of the arrow-rs functionality in RecordBatch, for example, assumes unique column names (find_by_name
for example). I don't think the Arrow spec itself restricts names to be unique, but I think the assumption may run pretty deep in the implementations
I assume you mean column_by_name
, https://docs.rs/datafusion/latest/datafusion/common/arrow/record_batch/struct.RecordBatch.html#method.column_by_name?
Ecosystem survey:
sqlite> select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1);
1
Postgres errors at column access time if there are duplicates:
postgres=# select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1) as foo;
ERROR: column reference "col1" is ambiguous
LINE 1: select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47...
^
This might be a duplicate of #6543?
@tv42 I agree it is certainly related
Is your feature request related to a problem or challenge?
In Spark, they have a concept of
ExprId
which is used to uniquely identify named expressions:https://github.com/apache/spark/blob/9bb358b51e30b5041c0cd20e27cf995aca5ed4c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L41-L57
Is it worth as attempting to introduce something similar in DataFusion?
There are issues being caused by rules in the optimizer comparing directly on column name leading to bugs when duplicate names appear, such as https://github.com/apache/arrow-datafusion/issues/8374
If during the analysis of a plan we can assign unique numeric IDs for columns, we could check for column equality based on these IDs and not need to compare string names.
The obvious downside would be this seems like a large effort in refactoring, not to mention breaking changes.
Describe the solution you'd like
Consider introduction of unique ID for columns/expressions to potentially simplify optimization/planning code
Describe alternatives you've considered
Don't do this (large refactoring effort? breaking changes?)
Additional context
Just a thought I had bouncing in my head, would appreciate to hear more thoughts on this (even if this seems unfeasible), or if there was already some prior discussion on a similar topic