duckdb / duckdb

DuckDB is an analytical in-process SQL database management system
http://www.duckdb.org
MIT License
24.29k stars 1.93k forks source link

Scoping of lambda functions in `list_apply` and `list_filter` seems artificially restrictive #9981

Closed cpcloud closed 10 months ago

cpcloud commented 11 months ago

What happens?

Using a parameter name in a lambda functions that matches the column name of the input column results in a strange error message about needing an unqualified parameter name

To Reproduce

D create or replace table t as select [1, 2, 3] as x;
D select list_apply(t.x, x -> x + 1) from t;
Error: Binder Error: Invalid parameter name 't.x': must be unqualified

Since the xs are unrelated to each other here I would expect there to be no issue executing this query.

OS:

Linux (NixOS)

DuckDB Version:

v0.9.2

DuckDB Client:

CLI

Full Name:

Phillip Cloud

Affiliation:

Voltron Data

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

eaubin commented 11 months ago

I noticed the same problem in the now stale closed issue #5739

taniabogatsch commented 11 months ago

Hi @cpcloud. I am unsure about this. We throw this error because when binding the RHS of the lambda function (x + 1 in your case), it is possible to use other columns of the table, which possibly have the same name as the lambda parameter. The binder error is unrelated to the input list. For example, the following query also triggers the binder error.

create or replace table t as select [1, 2, 3] as x;
select list_apply(['hello'], x -> x) from t;
Error: Binder Error: Invalid parameter name 't.x': must be unqualified

If we allow column names as lambda parameters, the following query would be ambiguous.

create or replace table t as select [[1], [2], [3]] as x;
select list_transform([[1], [2], [3]], x -> x[1]) from t;
Error: Binder Error: Invalid parameter name 't.x': must be unqualified
# possible result: x is a lambda parameter
[1, 2, 3]
# possible result: x is a column name in t
[[1], [1], [1]]

Using the column as the input list is possible, i.e., the following query works.

D create or replace table t as select [1, 2, 3] as x;
D select list_apply(t.x, y -> y + 1) from t;
┌─────────────────────────────────┐
│ list_apply(t.x, (y -> (y + 1))) │
│             int32[]             │
├─────────────────────────────────┤
│ [2, 3, 4]                       │
└─────────────────────────────────┘

I can look into making the error message more clear in this scenario, but I don't see a way to address the possible ambiguity between lambda parameter names and column names in the RHS of the lambda expression.

cpcloud commented 11 months ago

@taniabogatsch Thanks for looking into it.

I'm not sure I understand the ambiguity here, as long as there are well-defined scoping rules for lambdas.

create or replace table t as select [[1], [2], [3]] as x;
select list_transform([[1], [2], [3]], x -> x[1]) from t;
Error: Binder Error: Invalid parameter name 't.x': must be unqualified

Why would x be ambiguous here? It seems to me that this is unambiguously referring to the x argument of the lambda function and couldn't possibly ever be t.x without the table identifier qualifying it. I would expect lexical scoping here:

  1. x is first looked up as a local variable
  2. failing 1, x is looked up as a column
  3. failing 2, an error message is raised about not being able to resolve x

Other systems that implement equivalent functionality such as ClickHouse and Trino do not have this problem.

ClickHouse:

❯ clickhouse local
ClickHouse local version 23.11.1.1.

localhost :) create view t as select [1, 2, 3] x;

CREATE VIEW t AS
SELECT [1, 2, 3] AS x

Query id: 8c03f649-7cc9-4619-b141-765b9ab17ef7

Ok.

0 rows in set. Elapsed: 0.000 sec.

localhost :) select arrayMap(x -> x + 1, x) from t

SELECT arrayMap(x -> (x + 1), x)
FROM t

Query id: ca1b3c99-f996-480b-b445-98777383a5ef

┌─arrayMap(lambda(tuple(x), plus(x, 1)), x)─┐
│ [2,3,4]                                   │
└───────────────────────────────────────────┘

1 row in set. Elapsed: 0.001 sec.

Trino:

❯ trino --catalog memory --schema default
trino:default> create view t as select array[1, 2, 3] as x;
CREATE VIEW
trino:default> select transform(x, x -> x + 1) from t;
   _col0
-----------
 [2, 3, 4]
(1 row)

Query 20231218_115821_00236_x3bsy, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
0.10 [0 rows, 0B] [0 rows/s, 0B/s]
taniabogatsch commented 11 months ago

We bind lambda functions with our SQL scalar function binder, which does not have a concept of local variables that are not columns. I.e., the binder treats lambda parameters the same as it treats columns. That's also why we recently pushed this PR...

and couldn't possibly ever be t.x without the table identifier qualifying it

SQL scalar functions can reference column names without table identifiers, which should not change for lambda functions.

But after looking into this more now, I agree that adding scoping here for lambda functions, which are already a particular case, can make sense and make the lambda functions more intuitive to use. I should be able to change the lambda binding to prioritize lambda parameter names over column names and inner lambda parameter names over outer lambda parameter names, like so: inner lambda > outer lambda > column.

taniabogatsch commented 10 months ago

PR got merged.

cpcloud commented 9 months ago

Much appreciated!