apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.33k stars 1.2k forks source link

Support duplicate column aliases in queries #13489

Open findepi opened 2 days ago

findepi commented 2 days ago

Which issue does this PR close?

Rationale for this change

In SQL, selecting single column multiple times is legal and most modern databases support this. This commit adds such support to DataFusion too.

What changes are included in this PR?

Are these changes tested?

yes

Are there any user-facing changes?

yes, more valid queries are supported

jonahgao commented 1 day ago

allow creation of schemas for duplicated names

I think it's a good idea, as it allows the schema of the top plan to include duplicate names, thereby resolving #6543.

We can delay the name ambiguity check until a real column reference occurs. But currently, it seems that this check is not sufficient. For example

DataFusion CLI v43.0.0
> select t.a from (select 1 as a, 2 as a) t;
+---+
| a |
+---+
| 1 |
+---+
1 row(s) fetched.

This query did not return an error like it does in PostgreSQL and before. Perhaps we should improve ambiguity check when searching for field names from schemas after removing check_names.

findepi commented 1 day ago

We can delay the name ambiguity check until a real column reference occurs. But currently, it seems that this check is not sufficient. For example

DataFusion CLI v43.0.0
> select t.a from (select 1 as a, 2 as a) t;

Good catch. This is easy to solve.

The less easy part is that

select * from (select 1 as a, 2 as a) t;

should work. However, the * gets expanded to Expr expressions and these expressions have no way to differentiate between the two columns from a. This is because schema is used for both initial query analysis as well as in logical plans. Relates to https://github.com/apache/datafusion/issues/1468.

jonahgao commented 21 hours ago

The less easy part is that

select * from (select 1 as a, 2 as a) t;

should work. However, the * gets expanded to Expr expressions and these expressions have no way to differentiate between the two columns from a. This is because schema is used for both initial query analysis as well as in logical plans. Relates to #1468.

We might need to introduce column index to differentiate them. Since this case was not previously supported either, maybe we can handle it later.