PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.92k stars 217 forks source link

Make prql a PostgreSQL stored Procedure Language? #725

Open michelp opened 2 years ago

michelp commented 2 years ago

https://github.com/tcdi/pgx allows making PostgreSQL extensions with Rust, it seems like a logical connection would be to use pgx to make prql a stored procedure language in the database.

CREATE FUNCTION foo() RETURNS bar LANGUAGE plprql AS $$
... prql query here
$$;

Thoughts?

max-sixty commented 2 years ago

That looks very exciting!

So IIUC this would intercept the query, written in PRQL, compile it to SQL, and then send than on to postgres?

IIUC, extensions can do much more than we'd need — e.g. be involved during the query — but even if we don't use that, this would let people send PRQL directly to postgres. Is that correct?

Thanks for raising this @michelp .

michelp commented 2 years ago

So IIUC this would intercept the query, written in PRQL, compile it to SQL, and then send than on to postgres?

IIUC, extensions can do much more than we'd need — e.g. be involved during the query — but even if we don't use that, this would let people send PRQL directly to postgres. Is that correct?

Yes to both, and you can cache the execution plan itself and only pay for the compilation once. In addition prql could call other routines, be called by other routines, accept arguments, return simple and composite types, set returning functions: it would make prql a first class stored procedure language just like sql, plpgsql, plv8, plpythonu, etc.

I'm not a Rust programmer so unfortunately that's as specific as I can get, an extension can register a new language callback and be instantiated into a database with CREATE LANGUAGE

https://www.postgresql.org/docs/current/sql-createlanguage.html

Here's plpython3's callback as a starting point:

https://doxygen.postgresql.org/plpy__main_8c.html#afe79221c59a91b3b176c75a28a28709b

Thanks for raising this @michelp .

michelp commented 2 years ago

Looks like there's also a "sample" pl language for tests https://doxygen.postgresql.org/plsample_8c.html

max-sixty commented 2 years ago

Great, thanks!

If anyone would like to explore this more, please do! I'm guessing it's a big project, but it could be very cool indeed.

snth commented 2 years ago

This would be very cool indeed!

I just want to clarify one point as I think there might be a slight misunderstanding (possibly on my part). This only applies to Stored Procedures and Functions. This doesn't actually allow Postgres to understand general queries written in PRQL, unless those are wrapped in a Stored Procedure.

I've been using Postgres for a number of years and in my data warehouse workloads with dbt I haven't had to use Stored Procedures for anything so far.

Still very cool though. Just wanted to clarify that.

snth commented 2 years ago

Weird, I received an email that there was a comment on this thread but it's not showing it. The url from the thread was: https://github.com/prql/prql/issues/725#issuecomment-1171654531

Anyway, I'm going to reply to the comment which was:

@snth it should work with DO which would effectively allow running general queries, no?

I had never of DO blocks before but they look interesting:

DO executes an anonymous code block, or in other words a transient anonymous function in a procedural language.

The code block is treated as though it were the body of a function with no parameters, returning void. It is parsed and executed a single time.

My only concern is that, if they return void then can a DO block return any data? All the examples I have found so far just show granting some roles on a schema.

Similarly the documentation for Python Anonymous Code Blocks states that:

An anonymous code block receives no arguments, and whatever value it might return is discarded. Otherwise it behaves just like a function.

This also makes me believe that a DO block can't return any data.

So while you could execute a PRQL query inside the DO block, since DML is not part of the scope of PRQL and you can't return the result from the DO block, there isn't really any way to do anything useful with the query result.

michelp commented 2 years ago

Weird, I received an email that there was a comment on this thread but it's not showing it. The url from the thread was: #725 (comment)

Yeah same here, weird.

@snth it should work with DO which would effectively allow running general queries, no?

I had never of DO blocks before but they look interesting:

DO executes an anonymous code block, or in other words a transient anonymous function in a procedural language. The code block is treated as though it were the body of a function with no parameters, returning void. It is parsed and executed a single time.

My only concern is that, if they return void then can a DO block return any data? All the examples I have found so far just show granting some roles on a schema.

Nope DO blocks cannot return data, if you want them to DO something, they have to have a side effect (DDL, CALL procedure, etc).

This also makes me believe that a DO block can't return any data.

So while you could execute a PRQL query inside the DO block, since DML is not part of the scope of PRQL and you can't return the result from the DO block, there isn't really any way to do anything useful with the query result.

Right, maybe an interesting integration point for the future, but it seems like the most useful thing for prql would be as a body of a function you write (and returns data).

snth commented 2 years ago

Thanks for the clarification @michelp . Agreed.

eitsupi commented 1 year ago

FYI https://github.com/JPMoresmau/prqlproxy

kaspermarstal commented 1 year ago

Hi, I'm taking a stab at this as mentioned in #3455 and hope you can help me with a quick question.

I would like to replace identifiers with $1, $2, ..., $n for prepared statements. I believe the best option is to hook into the compilation and rewrite the identifiers in the AST. Is there any way to achieve this? Alternatively, I could redo the compile function and add an extra step, but the parser is a private module that I can't import. Any tips?

max-sixty commented 1 year ago

That's great you're working on this, looking forward to seeing how it goes. Definitely ask questions here.


I would like to replace identifiers with $1, $2, ..., $n for prepared statements.

What sort of identifiers? Like column names?

I doubt this is the answer, but we can get a output from midway through the compilation stage by looking at the RQ. That's available from the library, and also from prqlc as sql:preprocess:

echo 'from invoices | select {foo, $8}' | cargo run -p prqlc sql:preprocess
    Finished dev [unoptimized + debuginfo] target(s) in 0.09s
     Running `target/debug/prqlc 'sql:preprocess'`
[
    From(
        RIId(
            0,
        ),
    ),
    Super(
        Compute(
            Compute {
                id: column-2,
                expr: Expr {
                    kind: Param(
                        "8",
                    ),
                    span: Some(
                        0:29-31,
                    ),
                },
                window: None,
                is_aggregation: false,
            },
        ),
    ),
    Super(
        Select(
            [
                column-0,
                column-2,
            ],
        ),
    ),
    Super(
        Select(
            [
                column-0,
                column-2,
            ],
        ),
    ),
]

There's an even later AST called SRQ, although when I look at the output it looks less descriptive, I can look more if that'd be helpful.

We could also possibly expose the sqlparser-rs AST, if we want something very close to SQL.


the parser is a private module that I can't import.

What's the function we want there? We can make things public if needed (I thought much of it was already public, but it's possible we've missed things — our main uses for exporting are to the bindings, which don't go that deep into the lib)

kaspermarstal commented 1 year ago

Thanks! I now see the prqlto* family of functions, I think these are perfect.

There's an even later AST called SRQ, although when I look at the output it looks less descriptive, I can look more if that'd be helpful.

So if I need to walk the AST and replace identifier names, would the "RQ" level be best you think? And what would be the easiest way to walk the AST you think?

Let me provide a bit more context for you this time 😃 :

What sort of identifiers? Like column names?

Any parameter value (the function's parameters) needs to be considered when evaluating the resulting SQL. The easiest way to do this with PGRX seems to be with their prepare_statements functionality. And this requires renaming the argument names referenced in the function body to $1 and $2 etc.

What sort of identifiers? Like column names?

I think so yes, but also a lot of other kinds of identifers. I don't have a complete idea of when we are allowed to do it or not. I hope we can just do a "dumb" replace and let prql or postgres report errors if the user wrote invalid code.

max-sixty commented 1 year ago

OK great!

Any parameter value (the function's parameters) needs to be considered when evaluating the resulting SQL. The easiest way to do this with PGRX seems to be with their prepare_statements functionality. And this requires renaming the argument names referenced in the function body to $1 and $2 etc.

To confirm my understanding — do we want to replace things like foo in 'SELECT foo? Or things like$1in things likeWHERE $1 = "foo"`? (I'm not that familiar with postgres functions — forgive me if this is obvious re "argument names referenced in the function body")

If the SQL AST is sufficient, then that would be the easiest — it's the AST of sqlparser-rs which the compiler produces before stringifying it. It's processed at this line. I don't think we have a public function to grab this, but that would be trivial to add, I'd be happy to do that. (You could also take the SQL string and pass it back into sqlparser-rs, which would work for now though be inefficient.)

That said, not sure if that is sufficient — it depends if we need context from the PRQL query which gets lots as we compile to SQL.

kaspermarstal commented 1 year ago

First milestone would be to replace things like where column_name = $1. For example, the following function

create or replace function get_movies_by_year(year integer) 
    returns table (
        title varchar,
        release int
    ) 
as $$
from movies
select {title, release}
filter release == year
$$ 
language 'plprql';

Should compile to:

SELECT
  title,
  "release"
FROM
  movies
WHERE
  "release" = year

and we would replace WHERE "release" = year with `WHERE "release" = $1, create a prepared statement, and execute it. (Although this won't compile, I think, because "year" is an unknown name. Any ideas for handling this?)

I think we are both unsure which AST is best to use. Your suggestion for passing the SQL string back into sqlparser-rs is a very good idea. Let me explore this a bit. Thanks for offering to add a public function to expose it. I will let you know if we should go this route.

Finally, If there are postgres or pgrx experts out there who have suggestions for other approaches than prepared statements, please chime in!

max-sixty commented 1 year ago

I see! And it's not possible to have function get_movies_by_year($1 integer)...filter release == $1, and avoid the need to do the replacement?

And to confirm — we're not doing "create SQL from arbitrary PRQL" atm, just replacing specific params — is that right? (Is it possible to do the arbitrary compilation? Otherwise we're limited to the PRQL that's been pre-written, is that correct?)

Assuming we do have to do the replacement, I think the easiest way is to grab the PL with prql_to_pl, then implement PlFold. Implementing fold_expr_kind of that trait will let us check for whether it's an ExprKind::Ident which matches year, and replace it with a ExprKind::Param("1") if so. That's only (optimistically) a dozen lines of code. It doesn't cover everything — if there's a year name somewhere else in the query, then it'll pick that up too. But it's a good start.


Re this query — we need to add it to the select. Or also OK to swap the lines. But the existing query didn't have year available after the select

from movies
-select {title, release}
+select {title, release, year}
filter release == year
michelp commented 1 year ago

Hello!

I'm excited this idea is getting some traction, go @kaspermarstal ! I don't have too much to add but wanted to point out some thoughts based on what I've seen so far.

A prepared statement is a good approach, but worth nothing that it can contain only one statement, and only DML operations (may not be an issue for prql). Also the prepared statement's plan caching ability is only good for the life of the session. Another issue is that prepared statements (to my knowledge) cannot be inlined into a larger query like, say, a function written in straight SQL. These issues might not be a stopper for a proof of concept.

It may be worth looking into reusing as much of the sql language function implementation as possible. If prql compiles to sql, it may be possible to pipeline that into the existing sql function framework to do all the heavy lifting. This may net you some good benefits like the inlining support. I did a quick look over the sql language function implementation in postgres, it can be found here: https://github.com/postgres/postgres/blob/master/src/backend/executor/functions.c

It's a bit above my head and I'm guess a lot of this code is some of the earliest written in Postgres, and the git history goes back to 1996. The core devs on the postgres mailing list may be able to point you to some good ideas on how to reuse some of this code.

kaspermarstal commented 1 year ago

I see! And it's not possible to have function get_movies_by_year($1 integer)...filter release == $1, and avoid the need to do the replacement?

It might be possible to enforce only unnamed arguments and have the user write $n. At least to begin with. Will look into that. Otherwise I will look into the prql_to_pl approach you spelled out. Thanks!

And to confirm — we're not doing "create SQL from arbitrary PRQL" atm, just replacing specific params — is that right? (Is it possible to do the arbitrary compilation? Otherwise we're limited to the PRQL that's been pre-written, is that correct?)

Can elaborate what you mean with "arbitrarty PRQL" and "arbitrary compilation"? Also, what do you mean with PRQL that's been pre-written? A function is always pre-written prior to execution, right?

@michelp you raise some great points. I noticed that pgrx recently merged "fn_call" which may be exactly what we need to call a sql function. And this we can create via SPI.

kaspermarstal commented 1 year ago
from movies
-select {title, release}
+select {title, release, year}
filter release == year

This is fine for proof-of-concept but it feels wrong to me to select a column from the table that does not exist in the table if only to bring the name into scope. Are there other workarounds? Like, can we "declare" a variable like "year" on beforehand?

max-sixty commented 1 year ago
from movies
-select {title, release}
+select {title, release, year}
filter release == year

This is fine for proof-of-concept but it feels wrong to me to select a column from the table that does not exist in the table if only to bring the name into scope. Are there other workarounds? Like, can we "declare" a variable like "year" on beforehand?

Totally — the easiest approach is to move the filter up so that it's above the select transform, which cuts the columns that are available:

from movies
+filter release == year
select {title, release}
-filter release == year
max-sixty commented 1 year ago

Can elaborate what you mean with "arbitrary PRQL" and "arbitrary compilation"? Also, what do you mean with PRQL that's been pre-written? A function is always pre-written prior to execution, right?

@michelp you raise some great points. I noticed that pgrx recently merged "fn_call" which may be exactly what we need to call a sql function. And this we can create via SPI.

I had meant — would it theoretically be possible to send PRQL to Postgres, have the extension translate to SQL, and then have Postgres execute the SQL? By "arbitrary" I had meant that we could send any PRQL, rather than using only being able to use PRQL within stored procedures.

(tbc, I have basically no experience in postgres extensions / stored procs, so am asking these questions from a place of not knowing. And great to make some proof-of-concept and then we iterate on it in the future)

kaspermarstal commented 1 year ago

Totally — the easiest approach is to move the filter up so that it's above the select transform, which cuts the columns that are available:

from movies
+filter release == year
select {title, release}
-filter release == year

Ah I see!

I had meant — would it theoretically be possible to send PRQL to Postgres, have the extension translate to SQL, and then have Postgres execute the SQL? By "arbitrary" I had meant that we could send any PRQL, rather than using only being able to use PRQL within stored procedures.

Not natively, at least to my knowledge. You would have rewrite the query before it hits the database like https://github.com/JPMoresmau/prqlproxy (linked above) does.

Based on your inputs @max-sixty and @michelp I will do the following:

This lets us discuss design decisions based on something concrete, and perhaps we can make a roadmap from there.

When pgrx make their next release, which hopefully includes their fn_call functionality, I would also like to explore your excellent suggestion of using SQL functions @michelp.

kaspermarstal commented 11 months ago

Hi again, the prototype is now available at https://github.com/kaspermarstal/plprql.

Let's treat this as an RFC. To move our discussion forward, I suggest you read the design doc and look at the tests to see how the extension is exercised. You can also look at the code of course, if you have the time. Then let me know what you think.

There is one major blocker which is related to the pgrx API. This is described in detail in the design doc here.

I should also mention that I have little experience with rust and I am doing this as a learning project so tips on small things, coding style, error handling, and whatnot is also very welcome.

kaspermarstal commented 9 months ago

Hi, the last major blocker mentioned above is now resolved. While still alpha, the extension is now in a state where it can actually be used!

max-sixty commented 9 months ago

Awesome @kaspermarstal !

Sorry if I'm being slow to understand the discussion above — but to what extent could an extension translate any PRQL to SQL without having to create a new function for each query?

IIUC, at the moment using the extension relies on writing a create function command for each parameterized PRQL function we want to execute. If the function could instead accept arbitrary PRQL, then sending PRQL to the database would be comparable to sending SQL.

(I don't know much about postgres functions, and while not reliable, here's a ChatGPT conversation claiming that it's possible)

kaspermarstal commented 9 months ago

It's a great question, and I'll try to clarify how the extension works and what the limitations of PostgreSQL are.

Yes, the extension requires you to define a specific function for each PRQL query you want to execute.

Postgres only supports SQL natively. Support for other languages must use the "Procedural Language" mechanism in the same way that Python, Rust, Javascript etc etc are supported. These implementations also requires the user to create functions. (If someone out there on the big internet knows something I don't, please correct me)

So while this extension does not make PRQL a first-class citizen like SQL, it supports PRQL in the same way as other officially supported languages and community-supported languages.

I don't think there is a way around functions. Perhaps we could hook into lower layers at postgres and rewrite the incoming SQL there. And pgrx would have to expose what we need at these lower layers. But this is beyond my current knowledge of Postgres. Would have to investigate.

max-sixty commented 9 months ago

I am really not confident here, so forgive me if this is quite far off the mark. I'm pushing because this seems quite close to being really cool...

How about something like this? While Postgres functions seem to require being typed with the return type (i.e. the columns), it sounds like it's possible to return a cursor.

Here's an example an LLM made (I doubt it compiles at first, but hopefully offers an idea)

use pgx::*;
use std::ffi::CString;

pg_module_magic!();

// This function calls the external `compile_prql` function
extern "C" {
    fn compile_prql(input: *const c_char) -> *const c_char;
}

#[pg_extern]
fn execute_prql(query: &str) -> String {
    let query_cstr = CString::new(query).expect("Failed to create CString");
    let compiled_sql = unsafe { compile_prql(query_cstr.as_ptr()) };

    // Ensure the compiled SQL is valid
    if compiled_sql.is_null() {
        panic!("compile_prql returned a null pointer");
    }

    let sql = unsafe { CStr::from_ptr(compiled_sql).to_string_lossy().into_owned() };

    Spi::connect(|client| {
        let portal = client.open_anonymous_portal(&sql);
        portal.name().to_string()
    })
}

...and then would be called by SQLAlchemy like:

from sqlalchemy import create_engine

# Replace with your database connection details
engine = create_engine('postgresql://username:password@localhost/dbname')

with engine.connect() as conn:
    # Call the function and get the cursor name
    result = conn.execute("SELECT execute_prql('your_custom_query')")
    cursor_name = result.fetchone()[0]

    # Fetch from the cursor
    result = conn.execute(f"FETCH ALL IN \"{cursor_name}\"")
    for row in result:
        print(row)  # Each row from your query

This obv isn't all the way there — it's not possible to just do result = conn.execute("SELECT execute_prql('your_custom_query')") and get the actual rows. But it would allow for arbitrary PRQL queries...

kaspermarstal commented 9 months ago

I see why a function taking a PRQL string as an argument here is convenient.

But as you point out, the return type is the challenge. There's an interesting note on record types in the documention here:

Record variables are similar to row-type variables, but they have no predefined structure. They take on the actual row structure of the row they are assigned during a SELECT or FOR command. The substructure of a record variable can change each time it is assigned to.

Let me get back to you on this.

kaspermarstal commented 9 months ago

You are actually full of good ideas Max, the cursor approach works. This function:

create function prql(str text) returns refcursor as $$
declare
  cursor refcursor := 'prql_cursor';
begin
    open cursor for execute prql_to_sql(str);
    return (cursor);
end;
$$ language plpgsql;

Can be used like this:

select prql('from base.people | filter planet_id == 1 | sort name');
fetch 8 from prql_cursor;

Just need to make the cursor name configurable (edit: it is now configurable). See the test for details.

The record-based approach would be slightly more ergonomic but have not gotten it to work yet. Oh well.

max-sixty commented 9 months ago

That's awesome! Nice work!

To what extent is that shippable as a postgres extension? So someone could install the extension and then call it like your example?

That would be really cool, we'd be excited to put that on the home page etc.


The record-based approach would be slight more ergonomic but have gotten it to work yet. Oh well.

Yeah, it would save the extra line. But IIUC postgres doesn't seem to allow returning results themselves unless at least one of the function or the caller has an explicit type.

kaspermarstal commented 9 months ago

Thanks!

Yes, at this point it is fully shippable as a postgres extension and people can install it and call it like the example above. There's a guide on how to install from source in the readme. I'm exicited that your are excited, but do you agree that ideally I would work on documentation, CI, and binary distribution channels before official announcement? Do you have any comments on:

max-sixty commented 9 months ago

Great! Yes agree re ensuring it's easy to install.

TBH I don't have much context for what's standard among postgres extensions — those links look very reasonable though.

From PRQL's end, there are a couple of places it would be good to add docs:

Those should be quite easy to add to. If it's confusing where they come from, searching for the current text in the repo should yield where they're at.

(I would also vote to update the readme a bit so it references the new queries. and super-narrow point but the current design doc link isn't current :) )

Let us know anything we can do!

kaspermarstal commented 9 months ago

Great, for now, it would be really good to smoke test the extension, like installing from source and playing around on test data, and see if we can catch the most obvious bugs. Any volunteers?

I will get started on the documentation. README has been fixed :D

max-sixty commented 9 months ago

Great, for now, it would be really good to smoke test the extension, like installing from source and playing around on test data, and see if we can catch the most obvious bugs. Any volunteers?

Great! I'm happy to try things. I'd also encourage having automated tests for this if possible!

kaspermarstal commented 9 months ago

Just opened a PR for docs. In the extension's own repo, I added CI for test suite and installing from source to system installation of PostgreSQL.