PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.96k stars 218 forks source link

issue when using a DuckDb Macro #4150

Open maelp opened 9 months ago

maelp commented 9 months ago

What happened?

I have a macro defined in DuckDb to access Parquet files on s3

CREATE MACRO cse(year, weekA, weekB) AS TABLE SELECT * FROM read_parquet(list_transform(
            generate_series(weekA, weekB),
            week -> format('s3://${BUCKET_PATH}/year={0}/week={1}/0.parquet', year, week)
        ));

in SQL I can do

SELECT count(*) FROM cse(2024, 1, 5);

but if I try to compile the following PRQL

from cse(2024,1,5)
aggregate {
   count this
}

it gives an error

[{"kind":"Error","code":null,"reason":"unexpected , while parsing function call","hints":[],"span":"1:13-14","display":"Error: \n   ╭─[:1:14]\n   │\n 1 │ from cse(2024, 1, 5)\n   │              ┬  \n   │              ╰── unexpected , while parsing function call\n───╯\n","location":{"start":[0,13],"end":[0,14]}}]}

PRQL input

see above

SQL output

See above

Expected SQL output

No response

MVCE confirmation

Anything else?

No response

maelp commented 9 months ago

Right now I seem to be "forced" to use an s-string which is not very neat, eg

from s"SELECT * FROM cse(2024, 1, 4)"
aggregate {
   count this
}

or something, is there a better way?

maelp commented 9 months ago

similarly if I have a field with a "reserved name" like "timestamp" I cannot directly use it, I need to use s"timestamp", for instance

aggregate {
   min_time = min s"timestamp" -- using "min timestamp" would generate an error because "timestamp" is a type
}

isn't there a way to see that what follows is necessarily a value, and should be coerced to a value?

max-sixty commented 9 months ago

Hi @maelp , good questions, thanks for opening an issue.

Check out https://prql-lang.org/book/reference/syntax/keywords.html — these are solvable with backticks and this. respectively:

from `cse(2024,1,5)`
aggregate {
   sum this.time
}
SELECT
  COALESCE(SUM(time), 0)
FROM
  "cse(2024,1,5)"

-- Generated by PRQL compiler version:0.11.1 (https://prql-lang.org)

Does that make sense?

maelp commented 9 months ago

Thanks! However if I use this DuckDB MACRO

CREATE MACRO cse(year, weekA, weekB) AS TABLE SELECT * FROM read_parquet(list_transform(
            generate_series(weekA, weekB),
            week -> format('s3://${BUCKET_PATH}/year={0}/week={1}/0.parquet', year, week)
        ));

I think it doesn't create a "table" per-se, but should be used as SELECT * FROM cse(...)

do you think I should instead create a table? not sure if I can do this as a macro / function?

When I use your code I get the error:

Catalog Error: Table with name recent_cse(4) does not exist!
maelp commented 9 months ago

The generated SQL is this

WITH table_0 AS (
  SELECT
    bms_id,
    GREATEST(value.temperature_mosfet, value.temperature_ic) AS _expr_0,
    GREATEST(value.temperature_front, value.temperature_back) AS _expr_1,
    LEAST(value.temperature_front, value.temperature_back) AS _expr_2
  FROM
    "recent_cse(4)"
)
SELECT
  bms_id,
  MIN(_expr_2) AS min_temp_cells,
  MAX(_expr_1) AS max_temp_cells,
  MAX(_expr_0) AS max_temp_bms,
  COUNT(*) AS count
FROM
  table_0
WHERE
  _expr_2 < -10
  OR _expr_1 > 40
  OR _expr_0 > 70
GROUP BY
  bms_id

I guess it should not use the \" character because I want the macro to be run in duckdb no?

maelp commented 9 months ago

I verified that if I run the same query without the quotes, eg not "cse" but directly cse it works

eg

WITH table_0 AS (
  SELECT
    bms_id,
    GREATEST(value.temperature_mosfet, value.temperature_ic) AS _expr_0,
    GREATEST(value.temperature_front, value.temperature_back) AS _expr_1,
    LEAST(value.temperature_front, value.temperature_back) AS _expr_2
  FROM
    recent_cse(4)
)
SELECT
  bms_id,
  MIN(_expr_2) AS min_temp_cells,
  MAX(_expr_1) AS max_temp_cells,
  MAX(_expr_0) AS max_temp_bms,
  COUNT(*) AS count
FROM
  table_0
WHERE
  _expr_2 < -10
  OR _expr_1 > 40
  OR _expr_0 > 70
GROUP BY
  bms_id

is there a way to generate this simply from PRQL?

maelp commented 9 months ago

And if I try

from s"recent_cse(4)"
derive {
  min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
  max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
  max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
}
filter min_temp_cells < -10 || max_temp_cells > 40 || max_temp_bms > 70
group bms_id (
  aggregate {
    min_temp_cells = min min_temp_cells,
    max_temp_cells = max max_temp_cells,
    max_temp_bms = max max_temp_bms,
    count = count this,
  }
)```

it says

s-strings representing a table must start with SELECT

max-sixty commented 9 months ago

Ah, so we want literally FROM cse(2024,1,5) and not the quoted FROM "cse(2024,1,5)".

In that case, I think we do need to use the from s"SELECT * FROM cse(2024, 1, 4)", since currently PRQL expects an identifier.

We do have prior art for something more with the read_parquet-like functions; so it's not out of the question to do something similar.


This probably isn't going to make this perfect immediately, but it would be possible to make a function so the inline call were elegant:

prql target:sql.duckdb

let cse = x y z -> s"select * from cse({x},{y},{z})"

from (cse 1 2 3)
derive x = 5
WITH table_0 AS (
  SELECT
    *
  from
    cse(1, 2, 3)
)
SELECT
  *,
  5 AS x
FROM
  table_0

-- Generated by PRQL compiler version:0.11.1 (https://prql-lang.org)
maelp commented 9 months ago

would be interesting if we could have something like this! Or allow macros and duckdb functions in PRQL or something

maelp commented 9 months ago

BTW is it possible to do a multi-line PRQL function (this does not seem to be shown in the documentation)

I'm trying to do a helper function, then call it like this

let checkTemps = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (from (recent_cse numWeeks)
    derive {
      min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
      max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
      max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
    }
    filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
    group bms_id (
      aggregate {
        min_temp_cells = min min_temp_cells,
        max_temp_cells = max max_temp_cells,
        max_temp_bms = max max_temp_bms,
        count = count this,
      }
    )
)

checkTemps 4 -10 40 70

but it doesn't work...

max-sixty commented 9 months ago

The -10 needs to be parenthesized...

let checkTemps = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (
from (recent_cse)
    derive {
      min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
      max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
      max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
    }
    filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
    group bms_id (
      aggregate {
        min_temp_cells = min min_temp_cells,
        max_temp_cells = max max_temp_cells,
        max_temp_bms = max max_temp_bms,
        count = count this,
      }
    )
)

checkTemps 4 (-10) 40 70

...works!

(also I changed (recent_cse numWeeks) as that's not a function in the example)

Parentheses can be a bit confusing in some limited circumstances — explained here: https://prql-lang.org/book/reference/syntax/operators.html#parentheses. The main thing we can do is to make the error messages much much better.

max-sixty commented 9 months ago

would be interesting if we could have something like this! Or allow macros and duckdb functions in PRQL or something

We could have something very very simple like duckdb_macro cse 1 2 3. It's not elegant.

How do you / others think about macros vs functions? To the extent they're overlapping, I would deprioritize work here — having a boilerplate function for each macro doesn't seem too bad. To the extent they'll be coexisting lots, we could think about improving the interface.

maelp commented 9 months ago

Would be helpful to have a better help message indeed! I thought the issue was something else

yes the boilerplate can work I agree, so not a big priority. It's just that I want to be able to do queries both in SQL and PRQL so I'd like to have macros available in both to avoid duplicating code

would it be possible to rewrite that macro in pure PRQL (eg do you have access to duckdb generate_series and list_transform in PRQL? possibly in order to have those available, having something like a ::duckdb:generate_series and ::duckdb:list_transform as tokens (or a kind of PRQL "module" that we could import like "import duckdb" and then "duckdb.generate_series", "duckdb.list_transform" and "duckdb.call_macro" could be useful?)

eitsupi commented 9 months ago

Maybe you want to see this. https://eitsupi.github.io/querying-with-prql/method_chaining/

maelp commented 9 months ago

Sorry to use this issue to ask questions but I find that the documentation doesn't show all the power of PRQL yet, for instance: is there a way to do a ternary operation or an if / then / else in PRQL?

I'd like to do something like

select {
  result = someTemp < -10 ? "below" : "above"
}

is that possible?

eitsupi commented 9 months ago

You can use case https://prql-lang.org/book/reference/syntax/case.html

maelp commented 9 months ago

Thanks!

Last question (and small "bug report")

when I use this it works fine

let check_temperatures = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (
  from (recent_cse numWeeks)
  derive {
    min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
    max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
    max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
  }
  filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
  group bms_id (
    aggregate {
      min_temp_cells = min min_temp_cells,
      max_temp_cells = max max_temp_cells,
      max_temp_bms = max max_temp_bms,
      count = count this,
    }
  )
)

But if I'd like to "rephrase" the min_temp, max_temp, etc at the end to only display if they are above / under their respective threshold it fails, and the error message is weird (it complains about the "->" in the function definition)

let check_temperatures = numWeeks minCellTemp maxCellTemp maxBMSTemp -> (
  from (recent_cse numWeeks)
  derive {
    min_temp_cells = s"LEAST(value.temperature_front, value.temperature_back)",
    max_temp_cells = s"GREATEST(value.temperature_front, value.temperature_back)",
    max_temp_bms = s"GREATEST(value.temperature_mosfet, value.temperature_ic)",
  }
  filter min_temp_cells < minCellTemp || max_temp_cells > maxCellTemp || max_temp_bms > maxBMSTemp
  group bms_id (
    aggregate {
      min_temp_cells = min min_temp_cells,
      max_temp_cells = max max_temp_cells,
      max_temp_bms = max max_temp_bms,
      count = count this,
    }
    derive {
      min_temp_cells = case [
        min_temp_cells <= minCellTemp => min_temp_cells
        true => ""
      ],
      max_temp_cells = case [
        max_temp_cells >= maxCellTemp => max_temp_cells
        true => ""
      ],
      max_temp_cells = case [
        max_temp_bms >= maxBMSTemp => max_temp_bms
        true => ""
      ],
    }
  )
)

(another small nitpick: why not using the same "->" for case and function, and introducing another symbol like "=>" ?)

eitsupi commented 9 months ago

I think you forgot the comma in arrays.

case [
  foo # bad line
  bar
]
maelp commented 9 months ago

ah indeed you're right, but the error message is confusing, @max-sixty would there be a way to show the "correct" error in those cases?

max-sixty commented 9 months ago

@max-sixty would there be a way to show the "correct" error in those cases?

It's definitely possible. I'd say it's harder than I thought it would be when I originally proposed terser syntax...

For example, in that case we need to realize we're in a list, possibly attempt to execute a function bar with an argument foo & realize foo isn't a function — then add those up to say "maybe foo should have a trailing comma, and is part of a list". And we don't have column names atm.

I would really like to focus on this, and think this will be the focus of our GSOC application.

max-sixty commented 9 months ago

(another small nitpick: why not using the same "->" for case and function, and introducing another symbol like "=>" ?)

FYI there was some discussion on this in a past issue... I don't think anyone had a really strong view, we ended up favoring consistency with other langs