Support duplicate column aliases in queries

apache / datafusion

Apache DataFusion SQL Query Engine

https://datafusion.apache.org/

Apache License 2.0

6.33k stars 1.2k forks source link

Support duplicate column aliases in queries #13489

Open findepi opened 2 days ago

findepi commented 2 days ago

Which issue does this PR close?

Rationale for this change

In SQL, selecting single column multiple times is legal and most modern databases support this. This commit adds such support to DataFusion too.

What changes are included in this PR?

allow creation of schemas for duplicated names. DFSchema is used in logical plan to describe output of a relational operator, and In SQL, this is totally valid to have duplicated names
- a better long term fix would be to limit Schema use for field resolution during initial query plan building and use unambiguous "symbols" or "variables" in the logical plan. This would fall under https://github.com/apache/datafusion/issues/12723
add checks for CREATE TABLE and CREATE VIEW to disallow creation of tables with duplicate field names
- previously this was taken care of by schema construction checks, but only partially, as witnessed by https://github.com/apache/datafusion/issues/13487

Are these changes tested?

yes

Are there any user-facing changes?

yes, more valid queries are supported

jonahgao commented 1 day ago

allow creation of schemas for duplicated names

I think it's a good idea, as it allows the schema of the top plan to include duplicate names, thereby resolving #6543.

We can delay the name ambiguity check until a real column reference occurs. But currently, it seems that this check is not sufficient. For example

DataFusion CLI v43.0.0
> select t.a from (select 1 as a, 2 as a) t;
+---+
| a |
+---+
| 1 |
+---+
1 row(s) fetched.

This query did not return an error like it does in PostgreSQL and before. Perhaps we should improve ambiguity check when searching for field names from schemas after removing check_names.

findepi commented 1 day ago

We can delay the name ambiguity check until a real column reference occurs. But currently, it seems that this check is not sufficient. For example
DataFusion CLI v43.0.0
> select t.a from (select 1 as a, 2 as a) t;

Good catch. This is easy to solve.

The less easy part is that

select * from (select 1 as a, 2 as a) t;

should work. However, the * gets expanded to Expr expressions and these expressions have no way to differentiate between the two columns from a. This is because schema is used for both initial query analysis as well as in logical plans. Relates to https://github.com/apache/datafusion/issues/1468.

jonahgao commented 21 hours ago

The less easy part is that
select * from (select 1 as a, 2 as a) t;
should work. However, the * gets expanded to Expr expressions and these expressions have no way to differentiate between the two columns from a. This is because schema is used for both initial query analysis as well as in logical plans. Relates to #1468.

We might need to introduce column index to differentiate them. Since this case was not previously supported either, maybe we can handle it later.