PRQL / prql

PRQL is a modern language for transforming data — a simple, powerful, pipelined SQL replacement
https://prql-lang.org
Apache License 2.0
9.96k stars 218 forks source link

Support Substrait as a backend #738

Open snth opened 2 years ago

snth commented 2 years ago

Substrait describes itself as Substrait: Cross-Language Serialization for Relational Algebra and its vision is to:

Create a well-defined, cross-language specification for data compute operations

Given that PRQL compiles down to SQL, it should also map well onto the Relational Algebra concepts in Substrait.

There was a discussion around this sparked by a request for Ibis support on Discord: ibis thread on Discord

In fact the Ibis project are working on an ibis-substrait producer to serialize onto Substrait. The motivation that with this they could support hundreds of databases as backends without requiring specific adaptor for each. See this talk: Ibis and Substrait: Standardized Analytics for a good overview.

Edit: This is a better overview of the Substrait vision: Arrow and Substrait: Better Together

While Substrait appears to be still in its infancy, this may be a great way to support many, many backends without requiring to write individual SQL dialect adaptors.

aljazerzen commented 2 years ago

I really like the idea behind the project: a well defined standard for relational algebra.

But regarding the implementation I cannot find an example how would a basic query look like. Am I missing something?

If I understand correctly, a Substrait plan should be isomorphic with our AST after materialization. That is: a pipeline of transformations on relational data.

snth commented 2 years ago

So I've been using DataFusion as a backend for the prql cli tool I'm working on, and the more I'm looking into it, the more I think it's a framework that we should integrate into and which could solve a lot of problems for us. Kinda like with Substrait but this is a much easier integration and then we get Substrait support out of the box. @max-sixty and @aljazerzen I highly recommend you watch this video, particularly the timestamp I've linked to. It sounds like the aim for DataFusion is like an LLVM for databases. We should look at integrating PRQL as a 1st class Query Frontend for it:

Apache Arrow and DataFusion: Changing the Game for Implementing Database Systems

https://github.com/apache/arrow-datafusion

https://github.com/datafusion-contrib/datafusion-substrait

snth commented 2 years ago

Another idea is whether is to use the datafusion-sql query planner (tweet) (crate).

For example, to me the query plans read almost like PRQL in reverse

Sort: #state_tax DESC NULLS FIRST
  Projection: #c.id, #c.first_name, #c.last_name, #COUNT(UInt8(1)) AS num_orders, #SUM(o.price) AS total_price, #SUM(o.price * s.sales_tax) AS state_tax
    Aggregate: groupBy=[[#c.id, #c.first_name, #c.last_name]], aggr=[[COUNT(UInt8(1)), SUM(#o.price), SUM(#o.price * #s.sales_tax)]]
      Filter: #o.price > Int64(0) AND #c.last_name LIKE Utf8("G%")
        Inner Join: #c.id = #o.customer_id
          Inner Join: #c.state = #s.id
            SubqueryAlias: c
              TableScan: customer
            SubqueryAlias: s
              TableScan: state
          SubqueryAlias: o
            TableScan: orders

So could we parse that and generate PRQL from it? That would allow us to round-trip PRQL->SQL->PRQL and we could then do some fuzz testing on PRQL?

snth commented 2 years ago

Also, DataFusion has a DataFrame API which more easily maps onto the PRQL transforms we have. At that level I could even see myself writing a PRQL compiler that maps onto that API. The question I have whether they can produce SQL from it? Given that there is Substrait producer, it should be possible.

Transpiling to SQL gives us more universality but then we also have to deal with all the dialect problems. If Substrait became universally supported then this might be a much cleaner route.

snth commented 2 years ago

@aljazerzen Regarding your question as to what the plans actually look like, it's a bit difficult at the moment because they've only implemented the protobuf representation. From the talk I gather though that they are working on a YAML representation which would help a lot here. There was a sample YAML plan that I saw in another talk but it was very basic. See also Text Serialization at Substrait.io.


The datafusion-substrait crate implements a producer which could be a good starting point for implementing our own producer.

The official Substrait Rust bindings are here https://github.com/substrait-io/substrait-rs.


This is an excellent overview of the Substrait project: Substrait: Rethinking DBMS Composability (VLDB 2022, CDMS workshop)

image

image

aljazerzen commented 2 years ago

I see that quite some work has been done on substrait since I last checked it out. I have inspected their protobuf definitions for the query plans and altough quite complex, I think we could eventually use it as compilation target.

What's stopping us now is that there is no compiler from Subtrait to SQL (at least I can't find it). If we use substrait as compilation target, I wouldn't want to also support SQL target, since that would double the work for most of the development. At least before we finish major features, I would avoid having two targets.


DataFusion, as far as I understand, plans to also use substrait as their API for query plans, but it's not quite there yet. When it will be, we could compile PRQL to Substrait and then use DataFusion on that as query engine / optimizer.

I would avoid using DataFusion directly as compilation target, to keep number of targets low and because I am vary of fully depending on a project that is not yet established. I feel that depending on a specific query engine (DataFusion) is different - more risky - than depending on a protocol (Substrait).

snth commented 1 year ago

substrait.io recently added a page showcasing projects that support Substrait. Not too much there yet but useful for keeping an eye on things.

Powered by Substrait

max-sixty commented 1 year ago

That's a great list — it does look encouraging. Though the projects all look like they also support SQL too for the moment.

I'm mood-affiliated to support this, though until we can dispense with SQL, I'm not sure what the immediate benefits would be — maybe it's possible to express some things that SQL can't support? Or someone else builds a substrait->SQL transformer and we can have a single compile target?

My guess is that to do this, we would need someone who wants to take on a big project and drive this; and it would require becoming quite familiar with the backend of the compiler. It would be fine to start without committing to finish, though it would likely only be useful for users if it were mostly finished.

snth commented 1 year ago

TIL of mood-affiliation.

Yes, I don't think there's much point in supporting it until that list gets larger.

The arguments for supporting it would be (if it was universally or at least widely supported):

I think the last point in particular is what we kinda touched on during one of the dev calls.

The vision (or dream at this point) would be that Substrait takes off, so that Query Frontends like us can focus on building Substrait Producers and Query Processing Engines like RDBMs would be motivated to implement Substrait Consumers so that they can be supported by as many Query Frontends as possible.

In reality, that was probably the same vision behind SQL, and that lasted until the first SQL dialect popped up, and Substrait may well end up having the same fate. I had a chat with the DuckDB devs on their Discord about this at one point and that seemed to be their take on Substrait - it's nice in theory but in practice the engine differences will come back to spoil the fun.

snth commented 1 year ago

Blog post by Ibis on how to use Substrait and why they support it.

https://ibis-project.org/blog/ibis_substrait_to_duckdb/#why-wouldnt-i-just-use-sql-for-this