PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.65k stars 207 forks source link

Call user-defined functions #4542

Open vanillajonathan opened 1 month ago

vanillajonathan commented 1 month ago

What's up?

I would like to call a user-defined function (UDF) named get_magic_name and pass it the name column as parameter.

from animals
derive {
  magic_name = get_magic_name name
}

Error:

Error: 
   ╭─[:3:16]
   │
 3 │   magic_name = get_magic_name name
   │                ───────┬──────  
   │                       ╰──────── expected a function, but found `this.animals.get_magic_name`
───╯

Perhaps user-defined functions could be reached from a pseudo namespace called udf like:

from animals
derive {
  magic_name = udf.get_magic_name name
}
max-sixty commented 1 month ago

Can we adjust the example so get_magic_name is defined?

vanillajonathan commented 1 month ago

But get_magic_name is not defined in the PRQL script because it is a user-defined function (UDF) that has been created on the database server using the CREATE FUNCTION SQL statement so PRQL doesn't know about it.

m-span commented 1 month ago

Perhaps get_magic_name can be implemented using s-strings?

max-sixty commented 1 month ago

But get_magic_name is not defined in the PRQL script because it is a user-defined function (UDF) that has been created on the database server using the CREATE FUNCTION SQL statement so PRQL doesn't know about it.

Ah I see. We could have a call_udf function, which I think could be an s-string...

Can you confirm what the SQL output would look like?

vanillajonathan commented 1 month ago

I suppose it would look like this:

SELECT
  *,
  get_magic_name(name) AS magic_name
FROM
  animals
max-sixty commented 1 month ago

OK great.

I think this can be handled by an s-string at the moment.

If there were substantial demand we could add a call_udf function, but it might be so trivial that it wouldn't be worthwhile.

What do you think about adding an example to the docs?

snth commented 1 month ago

Hi @vanillajonathan ,

I once answered a question on HackerNews which I think speaks to what you are looking for: https://news.ycombinator.com/item?id=37569946

Unfortunately the JSON DuckDB extension isn't loaded in the playground so the example doesn't produce Query Results but hopefully the SQL produced explains enough:

> prqlc compile <<EOF
let get = func path obj -> s"""{obj} -> {path}"""
let getstr = func path obj -> s"""{obj} ->> {path}"""
let extract = func obj path -> s"""json_extract({obj}, {path})"""

from [{data='{"duck": [1, 2, 3]}'}]
select { data | get '$.duck[0]', data | getstr '$.duck[1]', extract data '$.duck[2]' }
EOF
WITH table_0 AS (
  SELECT
    '{"duck": [1, 2, 3]}' AS data
)
SELECT
  data -> '$.duck[0]',
  data ->> '$.duck[1]',
  json_extract(data, '$.duck[2]')
FROM
  table_0

-- Generated by PRQL compiler version:0.9.4 (https://prql-lang.org)

I've been meaning to add this to the docs somewhere as I think JSON functionality is something that would speak to a lot of people. Will try and get to it.

vanillajonathan commented 1 month ago

I think this is tricky to use.

You cannot just declare an line s-string like:

from animals
derive {
  magic_name = s"get_magic_name" name
}

because you get the error:

Error: 
   ╭─[:3:16]
   │
 3 │   magic_name = s"get_magic_name" name
   │                ────────┬────────  
   │                        ╰────────── expected a function, but found `s"get_magic_name"`
───╯

So you have to do declare the function with a let statement before you can use it.

let get_magic_name = param_name -> s"""get_magic_name({param_name})"""

from animals
derive {
  magic_name = name | get_magic_name
}

which is a bit cumbersome. It would be nice to not have to do that. Would be easier:

from animals
derive {
  magic_name = name | udf.get_magic_name
}
snth commented 1 month ago

Oh I see. That's an interesting idea.

So in that case it would have to infer how many parameters it is being passed and then generate AST node equivalent to the corresponding s-string. Not my area of expertise how feasible that is at the moment but I could see it working in principle.

We had a discussion elsewhere about a possible db namespace for database objects. That would also make it easier for the resolver whether it should find a name internally or just assume that it exists in the database. In that case I would suggest something like the following syntax:

from animals
derive {
  magic_name = name | db.fn.get_magic_name
}

That should indicate to the compiler that get_magic_name is a function that exists in the database and then it can simply treat it as such. One problem with that approach is that the return type becomes opaque to the compiler. That's not such a problem at the moment but as we try to formalise the type system that might be a problem. Forcing the let statement function definition allows you to provide the type annotations, I guess like a foreign function interface (FFI) in other languages. Maybe that could be a requirement for a "strict" mode which is relaxed otherwise.

vanillajonathan commented 1 month ago

Yeah!

max-sixty commented 1 month ago

I definitely see the case here.

I'm not sure the proposal quite fits — it uses a namespace to indicate a type. TBC, it's not crazy — maybe we store all the UDFs in a namespace. But in this case it works for any string, which isn't quite consistent — it's like a magic namespace where we can add anything — sounds like a type!

We could have a function like udf:

let udf = function_name param_name -> s"""{function_name}({param_name})"""

from animals
derive {
  magic_name = name | (udf get_magic_name)
}

...though we would need to implement something to allow passing a tuple of params; this only supports a single param...

vanillajonathan commented 1 month ago

Yeah, a udf function to call user-defined functions would be okay too, it would be better than what we have right now, because right now it is rather awkward having declare functions using s-strings prior to using them.

With risk of sounding whiny, PRQL is meant to make our lives easier but I feel that:

from animals
derive {
  magic_name = name | (udf get_magic_name)
}

looks harder to read, write and understand than SQL:

SELECT
  *,
  get_magic_name(name) AS magic_name
FROM
  animals
kgutwin commented 1 month ago

A couple more thoughts to add to this interesting discussion!

The query above can be written fully equivalently as:

from animals
derive {
  magic_name = udf get_magic_name name
}

which is really close to the original request (just with udf inserted in the right spot). It also might be able to be treated by the compiler as sugar for udf.get_magic_name or db.fn.get_magic_name, which are related to the proposals above. If we're going down the sugar route, we could also consider funky syntax like &get_magic_name.

Also, if PRQL doesn't have awareness of the database schema, then it isn't going to actually know if get_magic_name is a UDF or a database-native function. That's either a good thing or a bad thing, depending how you look at it, because it allows writing:

from animals
derive {
  full_name = udf concat first_name " " last_name
}

which would compile to:

SELECT
  *,
  concat(first_name, ' ', last_name) AS full_name
FROM
  animals

This would end up being another "escape hatch" to regular SQL, just as S-strings are right now.


I'm not sure if I have a preference in any of this. I think my tendency would be to declare all of the UDFs that I care about in a module, which would allow for type checking and make things a bit more explicit - but I'm not sure this is the most user-friendly.

module fns {
  let get_magic_name = func name <text> -> <text> s"get_magic_name({_param.name})"
}

from animals
derive {
  magic_name = fns.get_magic_name name
}

The above snippet works in the playground right now.