apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.54k stars 1.03k forks source link

Support LogicalPlan --> `SQL String` #8661

Open tempbottle opened 6 months ago

tempbottle commented 6 months ago

Part of https://github.com/apache/datafusion/issues/9494

Is your feature request related to a problem or challenge?

No response

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Sub tasks

alamb commented 6 months ago

Hi @tempbottle -- no sadly there is no way that I know if to go from logical plan --> SQL

I think it would be possible in theory, though probably a non trivial amount of code

A required step would be to go from Expr --> SQL expression, which might be a good place to start

devinjdangelo commented 6 months ago

I looked into this myself recently and also concluded that DataFusion has no way currently to recover the SQL strings that generated a LogicalPlan or parts of it (Exprs). Likewise, I don't think sqlparser-rs has any mechanism to go from AST back to raw SQL. Either case would be non trivial to implement, and raises the question of what SQL dialect you want to support.

I'm not sure what your use case is @tempbottle, but in my case I am looking at implementing TableProviders over top of other engines which accept SQL. For this use case, it may be more fruitful to look at https://substrait.io/. DataFusion does have the capability to convert a LogicalPlan to a Substrait plan. If the engine of interest does not consume Substrait, you are probably stuck writing custom glue code to map a LogicalPlan or Exprs into whatever SQL dialect or DSL that the target engine supports.

I'll also point out that GlareDb has some examples of mapping Exprs to SQL strings for the purposes of implementing a TableProvider. See e.g. https://github.com/GlareDB/glaredb/blob/ee4b440d9a594150926ce0a5fd96db400185c881/crates/datasources/src/postgres/mod.rs#L1242

alamb commented 6 months ago

Likewise, I don't think sqlparser-rs has any mechanism to go from AST back to raw SQL.

I am pretty sure the Display impl in sqlparser-rs recovers SQL from the parsed AST (as we have had many PRs related to keeping it working)

I agree substrait.io is a good idea for connecting other databases.

matthewmturner commented 6 months ago

@tempbottle @devinjdangelo FYI Ibis has a to_sql method that can be used to generate SQL and DataFusion is a backend for Ibis however I'm not sure if it has all the features required for to_sql to work (the latest i saw was that DataFusion still had a lot of gaps to fill with the Ibis wrapper). It also may not be the API you were looking for, but thought it could help as you could see how they implement it.

https://ibis-project.org/tutorials/ibis-for-sql-users

devinjdangelo commented 6 months ago

I am pretty sure the Display impl in sqlparser-rs recovers SQL from the parsed AST (as we have had many PRs related to keeping it working)

That's very interesting i'll have to look more closely at that. Thanks @alamb !

alamb commented 4 months ago

FYI I filed https://github.com/apache/arrow-datafusion/issues/9495 for the Expr --> String feature. That would likely be a necesary first step for a full LogicalPlan --> SQL conversion

backkem commented 4 months ago

I split out datafusion-federation's SQL Writer code into its own package and added an example for expressions and plans. I intend to mature it over time. It's open to contributions and/or moving to a more canonical place.

alamb commented 4 months ago

For anyone else following along, there is substantial discussion on https://github.com/apache/arrow-datafusion/issues/9495#issuecomment-1985057845 worth reading

backkem commented 4 months ago

Does it make sense for me to lay the same 'groundwork' for the LogicalPlan serialization as I did for the Expr in #9517? The suggestion would be similar: Upstream the datafusion-federation sql-write code as follows:

Any early thoughts or feedback on the datafusion-federation code sql-write code is appreciated, so I can address it in the process.

devinjdangelo commented 4 months ago

Does it make sense for me to lay the same 'groundwork' for the LogicalPlan serialization as I did for the Expr in https://github.com/apache/arrow-datafusion/pull/9517?

I am in support of this. We should be cautious about what we expose via public interfaces at first, since I think it is likely we will require many breaking changes to support all possible logical plans. The top level plan_to_sql function should be stable, but we will need freedom to rework methods below that.

Upstream the AST builders to sqlparser-rs.

I am a bit surprised there isn't an ergonomic way to build an AST already in sqlparser. I can see that the DFParser constructs Statements directly without a builder struct.

If the AST builder functionality is not particularly useful in sqlparser, it is likely best to include that in datafusion only rather than upstream to sqlparser. Do you think the AST builders would be useful upstream @alamb?

alamb commented 4 months ago

Does it make sense for me to lay the same 'groundwork' for the LogicalPlan serialization as I did for the Expr in #9517?

I agree with @devinjdangelo that I think it does.

If the AST builder functionality is not particularly useful in sqlparser, it is likely best to include that in datafusion only rather than upstream to sqlparser. Do you think the AST builders would be useful upstream @alamb?

I am not sure, to be honest. I think if we upstreamed a AST builder style thing in sqlparser, we would also want to migrate the existing parser to use it (and therefore take advantage of the existing tests). I think as long as it were backwards compatible it would make a nice contribution.

However, I would also say there is no compelling need to upstream it either. Having it start life in DataFusion would also be fine and then maybe upstream it in sqlparser

backkem commented 3 months ago

The initial PR (#9596) has landed. For anyone interested in helping to flesh out the LogicalPlan serialization to SQL, a good place to start are the remaining not_impl_err! statements in plan.rs. Add an integration test in roundtrip_statement in sql::tests and try to make it pass!

seddonm1 commented 3 months ago

Independently I have spent a fair bit of time on this problem. I think this current design of going to the sqlparser AST is a better approach than trying to go to SQL directly.

One thing that really helped was that I went and copied ~150ish queries from https://www.w3resource.com/sql-exercises/adventureworks/adventureworks-exercises.php to do a roundtrip test against. These are licensed https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en so could be added to the test suite.

As you start to add more queries you start to see a lot of edge cases (particularly relating to how aggregations work) that need to be dealt with.