PRQL query engine - Githubissues

aljazerzen commented 1 year ago

I've proposed we build Lutra a while back, with the intention of constructing a CLI for running PRQL. The motivation was to make "something that is an execution platform for PRQL". CLI seemed most useful way of providing that to a user like me. But the CLI ended up being the major part of the proposal, leaving the core idea a bit obfuscated.

So now, I'll revive the core idea and propose an execution engine, that would:

take PRQL an in input and
return the value of the input expression as the output.

This is different from the compiler in that the compiler returns an SQL query that can be executed to obtain the value of the input expression.

So make this happen, it would:

compile PRQL to appropriate SQL dialect,
use an appropriate adapter to make a database query,
convert the result to a common format (probably arrow),
print/write the output.

Why would this be useful?

the engine can do stuff that should not be done by the compiler (mainly network calls),
we already have something similar, scatterted between the playground, integration tests and prql-query,
current implementations are not fully correct, as they convert many types to strings,
a standard way to connect to a database would be useful for a Language Server Protocol, for retrieving database schema, which would be used for generating table type definitions,
this would be how you can "try out PRQL" the easy way.

I've arrived to the idea that we need something similar a few times now. This is probably because I see that we are constantly working on "PRQL -> SQL" instead of "PRQL -> relations", as I tried to put into words here.

Future expansions:

support different backends, including local computation engine,
pull data from multiple sources and apply additional computation on the joined datasets,
download external PRQL sources (packages)

My question now is: do you think this is worth developing?

max-sixty commented 1 year ago

I thought the idea had lots of promised when we discussed it before! IIRC it would also manage "a PRQL project" — which to some extent has morphed into our nascent module system.

I think my main concern is how PRQL-specific some of it is. There are lots of attempts to build query engines / database-agnostic middleware / etc — https://github.com/GlareDB/glaredb is another recent one, as an example.

The parts that are PRQL-specific would be:

a standard way to connect to a database would be useful for a Language Server Protocol, for retrieving database schema, which would be used for generating table type definitions

this would be how you can "try out PRQL" the easy way.

I worry the other parts are less PRQL specific, and also lots of work. And plausibly have a long tail of support — if it goes well, folks are going to ask how to convert postgis types to arrow, etc etc — it's not like it's a project that would ever finish.

Obviously folks should work on what they want to work on — open-source is successful when it's engaging, it's not like we have some fixed amount of our time to allocate. But if you're asking whether it would compound for the project, my guess would be that the PRQL-specific parts would be great — an LSP would be huuuge — but more general things, like the query middleware, would be less impactful per se.

Does that resonate at all? Or am I being too pessimistic?

aljazerzen commented 1 year ago

There are similar projects, but none of them integrates with PRQL. Also, my quick search did not yield any adapters that would connect to a database (i.e. Postgres), execute a SELECT query, fetch the data and convert it into arrow.

I do think you are underestimating the effect of having an execution platform. At the moment, having just the compiler means that running PRQL is something you need to setup yourself. This is not convenient for trying out and even less so for a quick one-off data transformation.

eitsupi commented 1 year ago

but none of them integrates with PRQL.

What do you mean by integrates? GlareDB has prql-compiler 0.10.1 built in. GlareDB/glaredb#2111

my quick search did not yield any adapters that would connect to a database (i.e. Postgres), execute a SELECT query, fetch the data and convert it into arrow.

I don't believe there is currently an ADBC implementation (https://arrow.apache.org/adbc/current/index.html) and Flight SQL driver (https://arrow.apache.org/adbc/current/driver/flight_sql.html) in Rust. Therefore, I feel that it is not a good idea to do these things in Rust at this time, considering future standards. Python, which should have bindings for all of these, should be an easy tool to create.

Or if we want to create a single binary tool, it may be easier on Go since usql and ADBC drivers already exist. https://github.com/xo/usql

aljazerzen commented 1 year ago

What do you mean by integrates?

Uses PRQL to the extent that the user does not need to know SQL.

I was not aware of GlareDB - it is quite close to engine that I'm describing.

ADBC with Rust bindings C/C++ implementation would solve the "adapter" part of what I'm proposing, so it would make sense to wait for it / help implement it.

eitsupi commented 1 year ago

Yeah, GlareDB has a built-in prql-compiler and access to various data sources like Trino, works on local and server (Postgres protocol), so I think we could easily do a lot with PRQL.

aljazerzen commented 1 year ago

Reading much more about glaredb and ADBC, I can say that:

glaredb looks very promising as a data tool and it does very similar things to what I imagine the query engine would be. It uses Apache Datafusion as the "engine" part and mainly provides the CLI and an easy way to deploy a local Postgres instance and connect to multiple sources from that instance,
ADBC is exactly what we'd need - a standard for connecting to databases. Initially I though it was required for target database to support Flight SQL, but they do have a few different "drivers", that can connect to databases. I like their way of compiling, loading and managing the drivers, as they can be compiled separately from the main library. The Rust implementation is lacking, so that would be the starting point of any efforts in that direction.

All in all, I'm even more confident in building this, since I won't have to build the "adapters" part from scratch and will be able to focus on "project management" part.

snth commented 1 year ago

Yeah, I also think you will be better off going with something like GlareDB. I had planned to go with connector-x for prql-query at the time but it turns out that while their Python API is beautiful and simple, the Rust API was not.

OT: I didn't see anything about Trino in the GlareDB docs (https://docs.glaredb.com/docs/data-sources/supported/). @eitsupi where did you see that?

max-sixty commented 1 year ago

I do think you are underestimating the effect of having an execution platform. At the moment, having just the compiler means that running PRQL is something you need to setup yourself.

I definitely agree with this, I think this is one of the most important issues for the project...

Is there some synthesis of the tradeoffs where we can make it much easier to start using PRQL without needing to maintain so much non-PRQL-specific functionality? For example — I focus over the holidays on a Jupyter extension? Or we work to get GlareDB / an alternative in a shape that works really well with PRQL? Maybe that's what @aljazerzen means re:

All in all, I'm even more confident in building this, since I won't have to build the "adapters" part from scratch and will be able to focus on "project management" part.

?

eitsupi commented 1 year ago

OT: I didn't see anything about Trino in the GlareDB docs (docs.glaredb.com/docs/data-sources/supported). @eitsupi where did you see that?

I just wanted to say that both Trino and GlareDB can handle various data sources centrally.

snth commented 1 year ago

Is there some synthesis of the tradeoffs where we can make it much easier to start using PRQL without needing to maintain so much non-PRQL-specific functionality?

I agree with this. I see this effort as quite outside the scope of PRQL currently. That's why I agreed to archive prql-query. I think prql-query could be revived quite easily by cutting the DuckDB dependency and just bumping the prql-compiler and DataFusion versions. However while I really wanted something like this to play around with initially, now I think it's easy enough to just pipe the resulting SQL code to whatever CLI tool you fancy.

For example — I focus over the holidays on a Jupyter extension?

Doesn't pyprql (which is mostly a Jupyter extension) work pretty well at the moment? I've only used it with DuckDB but I think JupySQL uses SQLAlchemy under the hood so it should work with almost any database.

Or we work to get GlareDB / an alternative in a shape that works really well with PRQL?

I think we should pursue this. Currently we have:

ClickHouse: works natively. IDK what connections CH has to other RDBMs
DuckDB: extension available. DuckDB can also query SQLite and Postgres.
GlareDB: works natively. It can be used to query the following:
- Azure Blob Storage
- BigQuery
- GCS
- MongoDB
- MySQL
- Postgres
- S3
- SQL Server
- Snowflake
JDBC driver: I still need to test this again but I think with this you can query anything that you can query with JDBC.

So I think there are quite a lot of options. We just need to highlight them and make them easy to use/install.

One other integration I would like to see is Trino which then would also make a whole slew of other datasources available via Trino connectors.

snth commented 1 year ago

a standard way to connect to a database would be useful for a Language Server Protocol, for retrieving database schema, which would be used for generating table type definitions,

LSP would be useful. I don't know much about that ecosystem so won't comment on it now. I would like to say something about the second part:

for retrieving database schema, which would be used for generating table type definitions,

My current thinking on this is that we should split it into two parts:

schema representations that the compiler works with
obtaining those schema representations

In #381 we discussed the getting the table definitions from the database and this suggestion https://github.com/PRQL/prql/issues/381#issuecomment-1162512916 by rslabbert to have something like "declaration files" (to keep the compiler fast) really resonated with me. These days there are various forms to represent data schemas, e.g. JSONSchema, and moreover you now also get Schema Registries (e.g. see https://www.conduktor.io/blog/what-is-the-schema-registry-and-why-do-you-need-to-use-it/ or https://recap.build/).

Therefore I think for part 1 we should design an internal schema representation for use by the compiler and then create adapters to allow roundtrip conversions from the popular schema formats like JSONSchema.

For part 2 we could then integrate something that retrieves these schema definitions either directly from a database or from a schema registry or data catalog. One reason for this is that you would want to do this rather infrequently and for the most part the compiler should just work with the cached schema representation. This will keep the compiler fast and would also allow you to work in offline contexts. Another example of the offline context might be when working with something like a large dbt project where many or most of the models might not have been built yet but the resulting schemas should be inferable from the model definitions.

aljazerzen commented 1 year ago

I very much agree - the division is exactly how I'm thinking about this.

For the "schema representations that the compiler works with", we have an issue open here: #2646. I've updated it with a few more examples.

For part 2: yes, that's the ideal architecture here.

What's currently missing is a way to describe the "connection parameters": a database URL, username, password, a way to define those in the project and commit them to VCS, a way to overload them from local user settings. These are "project metadata", analogous to what's written in Cargo.toml or package.json.

eitsupi commented 1 year ago

What's currently missing is a way to describe the "connection parameters": a database URL, username, password, a way to define those in the project and commit them to VCS, a way to overload them from local user settings. These are "project metadata", analogous to what's written in Cargo.toml or package.json.

For this, vscode-dbt-power-user, which uses dbt configuration files to connect to the DB, may be helpful.

aljazerzen commented 10 months ago

Update: I've packaged Apache ADBC (and Apache nanoarrow) and started working on rust bindings to ADBC. I've abandoned this effort in favor to using ConnectorX (see justification in the linked README).

eitsupi commented 10 months ago

Thanks for sharing it! The Arrow team seems to be looking for feedback so perhaps it might be worth sharing your repository. https://github.com/apache/arrow-adbc/pull/416#issuecomment-1445496089 https://github.com/apache/arrow-adbc/pull/1326#discussion_r1406475818

PRQL / prql

PRQL query engine #3825