apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.51k stars 1.03k forks source link

Make SQL strings generated from `Expr`s "prettier" #10557

Closed alamb closed 1 month ago

alamb commented 1 month ago

Is your feature request related to a problem or challenge?

Part of https://github.com/apache/datafusion/issues/9494

As @backkem says https://github.com/apache/datafusion/pull/10528#issuecomment-2116068547 on https://github.com/apache/datafusion/pull/10528

Currently, expressions from the DataFusion SQL unparser (aka expr --> String) are somewhat ugly

For example the expression col("a").eq(lit(5)) would be rendered as a = 5 by most poeple if they did it manaully, but DataFusion's unparser currently renders it like "a" = 5 (with extra quotes).

DataFusion also puts in quotes to make the order of operations explicit -- so instead of a < 5 AND b < 10 it would render ("a" < 5) AND ("b" < 10)

The current unparser is conservative and likely works well for when generating SQL for consumptions by other database systems. However, the SQL is not as nice for human consumption

Here is another instance from the example https://github.com/apache/datafusion/blob/98647e842a85b768ea0cb0f8ccf1016636001abb/datafusion-examples/examples/plan_to_sql.rs#L50-L53

Describe the solution you'd like

If we want to make the generated SQL easier to read by humans / more succint, these steps will have to be made "smarter".

Describe alternatives you've considered

Potential ideas:

  1. We'll have to add in the math rules to avoid unneeded parentheses
  2. (likely dialect specific) rules for determining of quoting is needed.

Note that the latter likely involves listing out the reserved keywords for each dialect.

Additional context

No response

goldmedal commented 1 month ago

Provide something I surveyed.

I think we can follow how Calcite handles the quoted issue. The SqlDialect of Calcite has a check rule identifierNeedsQuote.

https://github.com/apache/calcite/blob/aba64f0b217093b500629fe07a0befdc68293fbc/core/src/main/java/org/apache/calcite/sql/SqlDialect.java#L413-L415

It can be overridden according to the specific data source, such as BigQuery:

https://github.com/apache/calcite/blob/50a20824c4536450dcae963b5e757cf4bbc7e406/core/src/main/java/org/apache/calcite/sql/dialect/BigQuerySqlDialect.java#L106-L109

They implement some rules, such as regex patterns or reserved word lists. I think a dialect-specific rule is a nice choice.

backkem commented 1 month ago

Indeed, there is already a function on the sqlparser::dialect trait that takes this into account:

https://github.com/sqlparser-rs/sqlparser-rs/blob/54184460b5d873a67c2801e8b7c6e4f145bc65df/src/dialect/mod.rs#L113-L116

The dialect specific implementations just need to be expanded on. For now they just always return a conservative quote character.

goldmedal commented 1 month ago

https://github.com/sqlparser-rs/sqlparser-rs/blob/54184460b5d873a67c2801e8b7c6e4f145bc65df/src/dialect/mod.rs#L113-L116

The dialect specific implementations just need to be expanded on. For now they just always return a conservative quote character.

I think it's same as the dialect in datafusion::unparser::dialect https://github.com/apache/datafusion/blob/e7858ff0ab1c282ab46bd93cabc3dc83db583165/datafusion/sql/src/unparser/dialect.rs#L28 However, what we need is a checker to check if the identifier needs to be quoted. I think I can make a PR for DefaultDialect first.

goldmedal commented 1 month ago

As the mentioned in dialect.rs https://github.com/apache/datafusion/blob/e7858ff0ab1c282ab46bd93cabc3dc83db583165/datafusion/sql/src/unparser/dialect.rs#L19

I think we need to use the Dialect in sqlparser-rs instead and extract identifier_needs_quote in #10573 to sqlparser-rs. Just like https://github.com/sqlparser-rs/sqlparser-rs/pull/1170

backkem commented 1 month ago

Yes, these are basically the same object. The one in DataFusion was put there temporarily until the trait extension in the sqlparser repo is landed and pushed to crates.io. This may have happened in the meantime.

alamb commented 1 month ago

https://github.com/apache/datafusion/pull/10392 is the upgrade to sqlparser -- I think it is pretty close but @tisonkun hit an issue during upgrade.

tisonkun commented 1 month ago

10392 is the upgrade to sqlparser -- I think it is pretty close but @tisonkun hit an issue during upgrade.

We may need a 0.46.1 for resolving the regressions:

I've locally confirmed that array.slt is last failure for cargo test --test sqllogictests.

goldmedal commented 1 month ago

I'm not sure but I think we can merge #10573 first because it also fix many unpasring tests. Then, I'll create PR for sqlparser to add the check rule in dialect.

alamb commented 1 month ago

FWI https://github.com/apache/datafusion/pull/10573 is merged!

backkem commented 1 month ago

Do we split off a ticket reduce the nr of brackets emitted?

alamb commented 1 month ago

Do we split off a ticket reduce the nr of brackets emitted?

Excellent call -- I filed https://github.com/apache/datafusion/issues/10633