PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.92k stars 217 forks source link

Column reference error for `loop` inside functions #4717

Open snth opened 3 months ago

snth commented 3 months ago

What happened?

There's a regression in PRQL version:0.12.2 in that some loop examples no longer compile correctly.

The problems are:

The expected output runs correctly in shell.duckdb.org.

PRQL input

let calc_ewma = func span rel -> (
  from rel
  filter t==1
  select {t, x, ewma=x}
  loop (
    join rel (that.t==this.t+1)
    select {t, x, ewma = 2/(span+1)*x + (1-2/(span+1))*ewma}
  )
)

let data = [
  {ts=1, value=1},
  {ts=2, value=12},
  {ts=3, value=13},
  {ts=4, value=5},
  {ts=5, value=1},
  ]

from data
select {t=ts, x=value}
calc_ewma 3
select {ts=t, value=x, value_ewma3=ewma}

SQL output

WITH RECURSIVE table_0 AS (
  SELECT
    1 AS ts,
    1 AS value
  UNION
  ALL
  SELECT
    2 AS ts,
    12 AS value
  UNION
  ALL
  SELECT
    3 AS ts,
    13 AS value
  UNION
  ALL
  SELECT
    4 AS ts,
    5 AS value
  UNION
  ALL
  SELECT
    5 AS ts,
    1 AS value
),
data AS (
  SELECT
    ts,
    value
  FROM
    table_0
),
table_3 AS (
  SELECT
    ts AS t,
    value AS x
  FROM
    data
),
table_2 AS (
  SELECT
    _expr_0,
    _expr_1,
    _expr_1 AS _expr_2
  FROM
    table_3
  WHERE
    _expr_0 = 1
  UNION
  ALL
  SELECT
    table_5.t AS _expr_3,
    table_5.x AS _expr_4,
    2 / (3 + 1) * table_5.x + (1 - 2 / (3 + 1)) * table_2._expr_2
  FROM
    table_2
    JOIN (
      SELECT
        ts AS t,
        value AS x
      FROM
        data
    ) AS table_5 ON table_5.t = table_5.t + 1
)
SELECT
  t AS ts,
  x AS value,
  _expr_2 AS value_ewma3
FROM
  table_2 AS table_4

-- Generated by PRQL compiler version:0.12.2 (https://prql-lang.org)

Expected SQL output

WITH RECURSIVE table_0 AS (
  SELECT
    1 AS ts,
    1 AS value
  UNION
  ALL
  SELECT
    2 AS ts,
    12 AS value
  UNION
  ALL
  SELECT
    3 AS ts,
    13 AS value
  UNION
  ALL
  SELECT
    4 AS ts,
    5 AS value
  UNION
  ALL
  SELECT
    5 AS ts,
    1 AS value
),
data AS (
  SELECT
    ts,
    value
  FROM
    table_0
),
table_3 AS (
  SELECT
    ts AS t,
    value AS x
  FROM
    data
),
table_2 AS (
  SELECT
    t,
    x,
    x AS ewma
  FROM
    table_3
  WHERE
    t = 1
  UNION
  ALL
  SELECT
    table_5.t,
    table_5.x,
    2 / (3 + 1) * table_5.x + (1 - 2 / (3 + 1)) * table_2.ewma as ewma
  FROM
    table_2
    JOIN (
      SELECT
        ts AS t,
        value AS x
      FROM
        data
    ) AS table_5 ON table_5.t = table_2.t + 1
)
SELECT
  t AS ts,
  x AS value,
  ewma AS value_ewma3
FROM
  table_2 AS table_4

MVCE confirmation

Anything else?

This might have been introduced in version 0.10 or 0.11 already as I noticed my examples failing to compile then. I've just been pinning to 0.9.5 since then.

kgutwin commented 3 months ago

I wanted to try to bisect this, but I can't reproduce a successful compile of your PRQL query going back to version 0.9.0.

SQL output from 0.9.0:

WITH RECURSIVE table_1 AS (
  SELECT
    1 AS ts,
    1 AS value
  UNION
  ALL
  SELECT
    2 AS ts,
    12 AS value
  UNION
  ALL
  SELECT
    3 AS ts,
    13 AS value
  UNION
  ALL
  SELECT
    4 AS ts,
    5 AS value
  UNION
  ALL
  SELECT
    5 AS ts,
    1 AS value
),
data AS (
  SELECT
    ts,
    value
  FROM
    table_1
),
table_3 AS (
  SELECT
    ts AS t,
    value AS x
  FROM
    data
),
table_0 AS (
  SELECT
    _expr_0,
    _expr_1,
    _expr_1 AS _expr_2
  FROM
    table_3
  WHERE
    _expr_0 = 1
  UNION
  ALL
  SELECT
    table_4.t AS _expr_3,
    table_4.x AS _expr_4,
    (2 / (3 + 1)) * table_4.x + (1 - (2 / (3 + 1))) * table_0._expr_2
  FROM
    table_0
    JOIN (
      SELECT
        ts AS t,
        value AS x
      FROM
        data
    ) AS table_4 ON table_4.t = table_4.t + 1
)
SELECT
  t AS ts,
  x AS value,
  _expr_2 AS value_ewma3
FROM
  table_0

-- Generated by PRQL compiler version:0.9.0 (https://prql-lang.org)

I didn't go back further than that, as the syntax changes introduced in 0.9.0 would require rewriting your query.

Can you confirm a version of PRQL that does successfully compile?

snth commented 3 months ago

Thanks @kgutwin and sorry about the broken repro. I tried to create a self-contained repro from the stock price ewma example from this Google Colab notebook I use in presentations: https://colab.research.google.com/drive/146ohCI_WlnOKpQQ6sPMWu4uKGqZvwaIZ?usp=sharing

Looks like the column aliasing in select {t=ts, x=value} breaks it even in 0.9.5. It works if you remove that part, in 0.9.5 and 0.12.2 🤦‍♂️. So it looks like that's not the regression bug part but something else is because my actual example did stop working at some point which is why I've got it pinned to 0.9.5. Let me see if I can isolate that.

Here's the working example in the meantime (same calc_ewma function as above but different data):

let data = [
  {t=1, x=1},
  {t=2, x=12},
  {t=3, x=13},
  {t=4, x=5},
  {t=5, x=1},
  ]

from data
calc_ewma 3
snth commented 3 months ago

The following also works (in both versions):

let data = [
  {t=1, x=1},
  {t=2, x=12},
  {t=3, x=13},
  {t=4, x=5},
  {t=5, x=1},
  ]

from data
calc_ewma 3
derive x_ewma3=ewma
select !{ewma}

but this doesn't (in both):

let data = [
  {ts=1, y=1},
  {ts=2, y=12},
  {ts=3, y=13},
  {ts=4, y=5},
  {ts=5, y=1},
]

from data
derive {t=ts, x=y}
calc_ewma 3
derive x_ewma3=ewma
select !{t, x, ewma}

Btw, this last version is really the example we want (using derive {t=...} ... select !{...}) because that does the necessary aliasing to use the function and then leaves the original relation with the original columns. (Actually, that function drops other columns in the source relation so perhaps the result should be joined onto the original but that's a separate issue.)

Still trying to track down the original regression bug.

snth commented 3 months ago

Sorry, looks like it's actually one of the other examples that breaks

let list_subaccounts = root_key accounts -> (
  from parent=accounts
  filter parent_account_key==root_key
  select {parent.account_key, parent.account_name, account_path=parent.account_name, level=0}
  loop (
    join child=accounts (child.parent_account_key==parent.account_key)
    select {child.account_key, child.account_name, account_path=f"{account_path} / {child.account_name}", level=level+1}
    )
  select {level, account_name, account_path}
)

from balance_sheet
list_subaccounts 0          # Balance Sheet
# list_subaccounts 1272     # Assets

In 0.9.5 this produces:

WITH RECURSIVE table_0 AS (
  SELECT
    account_key,
    account_name,
    account_name AS account_path,
    0 AS level
  FROM
    balance_sheet AS parent
  WHERE
    parent_account_key = 0
  UNION
  ALL
  SELECT
    child.account_key AS _expr_0,
    child.account_name AS _expr_1,
    CONCAT(table_0.account_path, ' / ', child.account_name),
    table_0.level + 1
  FROM
    table_0
    JOIN balance_sheet AS child ON child.parent_account_key = table_0.account_key
)
SELECT
  level,
  account_name,
  account_path
FROM
  table_0

-- Generated by PRQL compiler version:0.9.5 (https://prql-lang.org/)

From 0.10 onwards it produces the following error:

Traceback (most recent call last):

  File "/usr/local/lib/python3.10/dist-packages/IPython/core/interactiveshell.py", line 3553, in run_code
    async_ :  Bool (Experimental)

  File ["<ipython-input-2-acfd2b139886>"](https://localhost:8080/#), line 2, in <cell line: 2>
    print(prql.compile("""

  File "<string>", line unknown
SyntaxError: Error:
   ╭─[:8:13]
   │
 8 │     select {child.account_key, child.account_name, account_path=f"{account_path} / {child.account_name}", level=level+1}
   │             ────────┬────────
   │                     ╰────────── Unknown name `child.account_key`
───╯

It took me a long time to track this down because when you actually include some sample data (as you would for a repro) then it works and the error doesn't show up, e.g.

let balance_sheet = from_text format:csv """
account_key,parent_account_key,account_name
1221,0000,Balance Sheet
1272,1221,Assets
1300,1272,Current Assets
1301,1300,Cash & Cash Equivalent
1302,1301,Bank Accounts
1307,1300,Inventories
"""

from balance_sheet
list_subaccounts 0      # Balance Sheet

works.

snth commented 3 months ago

Interestingly, without the function definition it compiles on 0.12.2:

from parent=accounts
filter parent_account_key==root_key
select {parent.account_key, parent.account_name, account_path=parent.account_name, level=0}
loop (
  join child=accounts (child.parent_account_key==parent.account_key)
  select {child.account_key, child.account_name, account_path=f"{account_path} / {child.account_name}", level=level+1}
  )
select {level, account_name, account_path}

produces

WITH RECURSIVE table_0 AS (
  SELECT
    account_key,
    account_name,
    account_name AS account_path,
    0 AS level
  FROM
    accounts AS parent
  WHERE
    parent_account_key = root_key
  UNION
  ALL
  SELECT
    child.account_key AS _expr_0,
    child.account_name AS _expr_1,
    CONCAT(table_0.account_path, ' / ', child.account_name),
    table_0.level + 1
  FROM
    table_0
    JOIN accounts AS child ON child.parent_account_key = table_0.account_key
)
SELECT
  level,
  account_name,
  account_path
FROM
  table_0 AS table_1

-- Generated by PRQL compiler version:0.12.2 (https://prql-lang.org)

I'd argue that the column aliases in the recursive part of the query are wrong but DuckDB doesn't seem to mind and it produces the correct result set if you run the whole example:

let accounts = from_text format:csv """
account_key,parent_account_key,account_name
1221,0000,Balance Sheet
1272,1221,Assets
1300,1272,Current Assets
1301,1300,Cash & Cash Equivalent
1302,1301,Bank Accounts
1307,1300,Inventories
"""

let root_key=0

from parent=accounts
filter parent_account_key==root_key
select {parent.account_key, parent.account_name, account_path=parent.account_name, level=0}
loop (
  join child=accounts (child.parent_account_key==parent.account_key)
  select {child.account_key, child.account_name, account_path=f"{account_path} / {child.account_name}", level=level+1}
  )
select {level, account_name, account_path}

For the record, the correct SQL for the recursive section should probably be something like the following:

...
  UNION
  ALL
  SELECT
    child.account_key,
    child.account_name,
    CONCAT(table_0.account_path, ' / ', child.account_name) AS account_path,
    table_0.level + 1 AS level
  FROM
    table_0
    JOIN accounts AS child ON child.parent_account_key = table_0.account_key
...
snth commented 3 months ago

I've changed the name of the issue.

Should I delete the first few EWMA examples since they actually work fine and are a lot to read which doesn't add to the actual problem description?