Consider introducing unique expression IDs in Logical/Physical plan

Jefffrey commented 5 months ago

Is your feature request related to a problem or challenge?

In Spark, they have a concept of ExprId which is used to uniquely identify named expressions:

https://github.com/apache/spark/blob/9bb358b51e30b5041c0cd20e27cf995aca5ed4c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala#L41-L57

/**
 * A globally unique id for a given named expression.
 * Used to identify which attribute output by a relation is being
 * referenced in a subsequent computation.
 *
 * The `id` field is unique within a given JVM, while the `uuid` is used to uniquely identify JVMs.
 */
case class ExprId(id: Long, jvmId: UUID) {

  override def equals(other: Any): Boolean = other match {
    case ExprId(id, jvmId) => this.id == id && this.jvmId == jvmId
    case _ => false
  }

  override def hashCode(): Int = id.hashCode()

}

Is it worth as attempting to introduce something similar in DataFusion?

There are issues being caused by rules in the optimizer comparing directly on column name leading to bugs when duplicate names appear, such as https://github.com/apache/arrow-datafusion/issues/8374

If during the analysis of a plan we can assign unique numeric IDs for columns, we could check for column equality based on these IDs and not need to compare string names.

The obvious downside would be this seems like a large effort in refactoring, not to mention breaking changes.

Describe the solution you'd like

Consider introduction of unique ID for columns/expressions to potentially simplify optimization/planning code

Describe alternatives you've considered

Don't do this (large refactoring effort? breaking changes?)

Additional context

Just a thought I had bouncing in my head, would appreciate to hear more thoughts on this (even if this seems unfeasible), or if there was already some prior discussion on a similar topic

alamb commented 5 months ago

I wonder what "unique" means ? Like every newly created Expr gets some sort of id?

Some examples:

// would expr1 and expr2 have the same id?
let expr1 = col("foo");
let expr2 = col("foo");

// would expr1 and expr2 have the same id?
let expr1 = col("foo");
let expr2 = expr1.clone()

Jefffrey commented 5 months ago

Actually I think I was off the mark on what ExprId is intended to do, it seems it would be more useful if there were a new LogicalExpr enum such as AttributeReference, which would refer to an expr from the parent plan by ExprId

Like given a logical plan:

Projection: a.int_col, b.double_col, CAST(a.date_string_col AS Utf8)
  Inner Join: a.int_col = b.int_col
    SubqueryAlias: a
      Projection: alltypes_plain.int_col, alltypes_plain.date_string_col
        Filter: alltypes_plain.id > Int32(1)
          TableScan: alltypes_plain projection=[id, int_col, date_string_col], partial_filters=[alltypes_plain.id > Int32(1)]
    SubqueryAlias: b
      Projection: alltypes_plain.int_col, alltypes_plain.double_col
        Filter: CAST(alltypes_plain.tinyint_col AS Float64) < alltypes_plain.double_col
          TableScan: alltypes_plain projection=[tinyint_col, int_col, double_col], partial_filters=[CAST(alltypes_plain.tinyint_col AS Float64) < alltypes_plain.double_col]

That top level projection has a.int_col as a Column for example, which when turned into physical plan needs to search the parent schema by name

https://github.com/apache/arrow-datafusion/blob/a6e6d3fab083839239ef81cf3a3546dd8929a541/datafusion/core/src/physical_planner.rs#L879-L891

Whereas with exprid's, it could be possible for a.int_col to be an AttributeReference which references the parent expr list to point to which expr it references by id.

And I think each new expr would have a new ID.

Honestly I could be way off the mark here on the usages/benefits of exprid 😅

It's just something I was thinking about, especially in relation to how verbose it can be to check if columns are the same when taking into account table, schema and catalog parts of the identifier for a column

See troubles with ambiguity check here https://github.com/apache/arrow-datafusion/issues/6012

So instead of having to find the original column of a projected column in a logical plan via name during logical optimization/physical planning, could have that done once off in an analyzer rule pass then afterwards use exprids

alamb commented 5 months ago

What you describe seems similar to the usize index in physical_plan Column references. And as I recall @houqp added exactly to handle this case of ambiguous schema 🤔

I believe Postgres has a similar way of addressing fields (it does so by index rather than column name) in its logical exprs.

The downside I remember from working wit postgres was that there were many potential hard to track down issues when the indexes got messed up.

It might be interesting to see what this would look like in DataFusion 🤔

tv42 commented 5 months ago

Related to this, SQLite and Postgres allow duplicate column names in results. From SQLite tests:

SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1

Datafusion errors out with

query failed: Error during planning: Projections require unique expression names but the expression "COUNT(*) AS col1" at position 0 and "Int64(-92) - Int64(-47) AS col1" at position 1 have the same name. Consider aliasing ("AS") one of them.

Maybe after/as part of implementing this, Datafusion should relax that restriction?

alamb commented 4 months ago

Maybe after/as part of implementing this, Datafusion should relax that restriction?

One challenge is that most of the arrow-rs functionality in RecordBatch, for example, assumes unique column names (find_by_name for example). I don't think the Arrow spec itself restricts names to be unique, but I think the assumption may run pretty deep in the implementations

tv42 commented 4 months ago

I assume you mean column_by_name, https://docs.rs/datafusion/latest/datafusion/common/arrow/record_batch/struct.RecordBatch.html#method.column_by_name?

Ecosystem survey:

SQLite picks the first matching column:

sqlite> select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1);
1

Postgres errors at column access time if there are duplicates:

postgres=# select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47 AS col1) as foo;
ERROR:  column reference "col1" is ambiguous
LINE 1: select col1 from (SELECT + COUNT( * ) AS col1, - 92 - + - 47...
             ^

tv42 commented 2 weeks ago

This might be a duplicate of #6543?

alamb commented 2 weeks ago

@tv42 I agree it is certainly related

apache / datafusion