datafusion-contrib / datafusion-federation

Allow DataFusion to resolve queries across remote query engines while pushing down as much compute as possible down.
Apache License 2.0
77 stars 18 forks source link

feat: Add AST analyzer middleware to optionally rewrite queries before execution #55

Closed phillipleblanc closed 2 months ago

phillipleblanc commented 2 months ago

Adds an AST analyzer to the SQLExecutor trait to allow modifying the query AST before execution.

phillipleblanc commented 2 months ago

This is used in the SQLite Table Provider to perform a non-trivial rewrite of interval expressions to use the SQLite native date() or datetime() functions, since SQLite doesn't natively support intervals.

https://github.com/datafusion-contrib/datafusion-table-providers/blob/main/src/sqlite/federation.rs#L49

jonmmease commented 1 month ago

Apologies for commenting in this old thread, but I'm working on learning about DataFusion's unparse sql generation support and datafusion-federation. Is there a reason (other than project logistics) that this kind of interval rewrite logic for sqllite should be done in an ast_analyzer step here in datafusion-table-providers rather than in the unparse logic in DataFusion itself?

I'm exploring replacing the sql-generation system we have in VegaFusion with unparse and datafusion-federation, so just trying to get a feel for where dialect customization should happen.

hozan23 commented 1 month ago

Hello @jonmmease, There was a discussion about this in issue #61

backkem commented 1 month ago

@jonmmease maybe there are already table providers in datafusion-contrib/datafusion-table-providers that meet your needs? These also implement federation. That are used if you register the federation optimizer.

jonmmease commented 1 month ago

Thanks @backkem, yes these table providers look great, and I'm excited to learn more about how federation works!

I'm thinking about eventually adding support for some additional dialects (like Clickhouse, Spark, MSSQL, etc) and wondering if this should primarily be contributed upstream in the unparse logic in the datafusion-sql crate, or if I should customize the generic dialect there with the proper quote style and then do semantic transformations (like different names for functions) in an ast_analyzer implementation.

backkem commented 1 month ago

RE where to land things: it's a good question. Right now I think we have roughly:

Things can evolve but this is the balance that emerged so far. I'm happy to host more plan re-writing tools/logic that is federation related in this repo.

jonmmease commented 1 month ago

Thanks for that context @backkem, I appreciate it and it's very helpful. As I've thought about it, it does make sense for the semantic customization of dialects to live alongside the table provider implementations for the sake of testing the generated SQL.

phillipleblanc commented 1 month ago

Yeah, I still think its not 100% clear where the responsibility of the Unparser dialect ends and where something like the AST rewriter in the specific implementation begins. Like @backkem, I tend to think that the core Unparser should mostly be focused on syntactical differences between the dialects whereas more complex differences in how, e.g., functions work can be implemented by the AST rewriter (as we did for handling intervals in SQLite: https://github.com/datafusion-contrib/datafusion-table-providers/pull/85)