Open snth opened 2 years ago
I really like the idea behind the project: a well defined standard for relational algebra.
But regarding the implementation I cannot find an example how would a basic query look like. Am I missing something?
If I understand correctly, a Substrait plan should be isomorphic with our AST after materialization. That is: a pipeline of transformations on relational data.
So I've been using DataFusion as a backend for the prql cli tool I'm working on, and the more I'm looking into it, the more I think it's a framework that we should integrate into and which could solve a lot of problems for us. Kinda like with Substrait but this is a much easier integration and then we get Substrait support out of the box. @max-sixty and @aljazerzen I highly recommend you watch this video, particularly the timestamp I've linked to. It sounds like the aim for DataFusion is like an LLVM for databases. We should look at integrating PRQL as a 1st class Query Frontend for it:
Apache Arrow and DataFusion: Changing the Game for Implementing Database Systems
Another idea is whether is to use the datafusion-sql
query planner (tweet) (crate).
For example, to me the query plans read almost like PRQL in reverse
Sort: #state_tax DESC NULLS FIRST
Projection: #c.id, #c.first_name, #c.last_name, #COUNT(UInt8(1)) AS num_orders, #SUM(o.price) AS total_price, #SUM(o.price * s.sales_tax) AS state_tax
Aggregate: groupBy=[[#c.id, #c.first_name, #c.last_name]], aggr=[[COUNT(UInt8(1)), SUM(#o.price), SUM(#o.price * #s.sales_tax)]]
Filter: #o.price > Int64(0) AND #c.last_name LIKE Utf8("G%")
Inner Join: #c.id = #o.customer_id
Inner Join: #c.state = #s.id
SubqueryAlias: c
TableScan: customer
SubqueryAlias: s
TableScan: state
SubqueryAlias: o
TableScan: orders
So could we parse that and generate PRQL from it? That would allow us to round-trip PRQL->SQL->PRQL and we could then do some fuzz testing on PRQL?
Also, DataFusion has a DataFrame API which more easily maps onto the PRQL transforms we have. At that level I could even see myself writing a PRQL compiler that maps onto that API. The question I have whether they can produce SQL from it? Given that there is Substrait producer, it should be possible.
Transpiling to SQL gives us more universality but then we also have to deal with all the dialect problems. If Substrait became universally supported then this might be a much cleaner route.
@aljazerzen Regarding your question as to what the plans actually look like, it's a bit difficult at the moment because they've only implemented the protobuf representation. From the talk I gather though that they are working on a YAML representation which would help a lot here. There was a sample YAML plan that I saw in another talk but it was very basic. See also Text Serialization at Substrait.io.
The datafusion-substrait crate implements a producer which could be a good starting point for implementing our own producer.
The official Substrait Rust bindings are here https://github.com/substrait-io/substrait-rs.
This is an excellent overview of the Substrait project: Substrait: Rethinking DBMS Composability (VLDB 2022, CDMS workshop)
I see that quite some work has been done on substrait since I last checked it out. I have inspected their protobuf definitions for the query plans and altough quite complex, I think we could eventually use it as compilation target.
What's stopping us now is that there is no compiler from Subtrait to SQL (at least I can't find it). If we use substrait as compilation target, I wouldn't want to also support SQL target, since that would double the work for most of the development. At least before we finish major features, I would avoid having two targets.
DataFusion, as far as I understand, plans to also use substrait as their API for query plans, but it's not quite there yet. When it will be, we could compile PRQL to Substrait and then use DataFusion on that as query engine / optimizer.
I would avoid using DataFusion directly as compilation target, to keep number of targets low and because I am vary of fully depending on a project that is not yet established. I feel that depending on a specific query engine (DataFusion) is different - more risky - than depending on a protocol (Substrait).
substrait.io recently added a page showcasing projects that support Substrait. Not too much there yet but useful for keeping an eye on things.
That's a great list — it does look encouraging. Though the projects all look like they also support SQL too for the moment.
I'm mood-affiliated to support this, though until we can dispense with SQL, I'm not sure what the immediate benefits would be — maybe it's possible to express some things that SQL can't support? Or someone else builds a substrait->SQL transformer and we can have a single compile target?
My guess is that to do this, we would need someone who wants to take on a big project and drive this; and it would require becoming quite familiar with the backend of the compiler. It would be fine to start without committing to finish, though it would likely only be useful for users if it were mostly finished.
TIL of mood-affiliation.
Yes, I don't think there's much point in supporting it until that list gets larger.
The arguments for supporting it would be (if it was universally or at least widely supported):
I think the last point in particular is what we kinda touched on during one of the dev calls.
The vision (or dream at this point) would be that Substrait takes off, so that Query Frontends like us can focus on building Substrait Producers and Query Processing Engines like RDBMs would be motivated to implement Substrait Consumers so that they can be supported by as many Query Frontends as possible.
In reality, that was probably the same vision behind SQL, and that lasted until the first SQL dialect popped up, and Substrait may well end up having the same fate. I had a chat with the DuckDB devs on their Discord about this at one point and that seemed to be their take on Substrait - it's nice in theory but in practice the engine differences will come back to spoil the fun.
Blog post by Ibis on how to use Substrait and why they support it.
https://ibis-project.org/blog/ibis_substrait_to_duckdb/#why-wouldnt-i-just-use-sql-for-this
Substrait describes itself as Substrait: Cross-Language Serialization for Relational Algebra and its vision is to:
Given that PRQL compiles down to SQL, it should also map well onto the Relational Algebra concepts in Substrait.
There was a discussion around this sparked by a request for Ibis support on Discord: ibis thread on Discord
In fact the Ibis project are working on an ibis-substrait producer to serialize onto Substrait. The motivation that with this they could support hundreds of databases as backends without requiring specific adaptor for each. See this talk: Ibis and Substrait: Standardized Analytics for a good overview.
Edit: This is a better overview of the Substrait vision: Arrow and Substrait: Better Together
While Substrait appears to be still in its infancy, this may be a great way to support many, many backends without requiring to write individual SQL dialect adaptors.