datafusion-contrib / datafusion-federation

Allow DataFusion to resolve queries across remote query engines while pushing down as much compute as possible down.
Apache License 2.0
73 stars 17 forks source link

Delete the rewrite_table_scan from the ExecutionPlan and use it within an AnalyzerRule #61

Closed hozan23 closed 1 month ago

hozan23 commented 1 month ago

Hello,

We are thinking of moving the recently added rewrite_table_scan function into an AnalyzerRule instead of running it in the ExecutionPlan. This change makes sense since the AnalyzerRule is responsible for the semantic changes when transforming the LogicalPlans.

What do you think? @phillipleblanc

cc: @backkem

phillipleblanc commented 1 month ago

Makes sense to me. One caveat we've run into is that you can't blindly rewrite the plan on the DataFusion side, since DataFusion still needs the table names it knows about to work properly. The LogicalPlan generated by rewrite_table_scan is only useful as input to the Unparser

backkem commented 1 month ago

Yea, good point.

We have an implementation where we basically have two separate schemas: the external or "model" schema represented by a ModelSchemaProvider and ModelTableProviders and a internal/storage schema represented by a StorageSchemaProvider and StorageTableProviders. The former "model" schema is used during query parsing. The ModelTableProviders have a method to fetch the corresponding StorageTableProvider. It's the rewrite analyser that rewrites the entire LogicalPlan from using ModelTableProviders to StorageTableProviders. The storage schema is used during execution.

I'd love to hear your insight on that approach. I think it's more general. For example, it would work when using plain table providers or when trying to federate using Substrait plans. On the other hand, it could introduce too many new abstraction into the mix.

phillipleblanc commented 1 month ago

In general I'm a fan of simpler systems when possible. I'm not really convinced that adding these abstractions buys us all that much. i.e. for plain table providers, its relatively straightforward to keep state on the internal name and when you call .scan() on it, you don't even have to care what the name datafusion has for you (we do this for all of our non-federated custom providers). For Substrait plan federation, I would just have its implementation call rewrite_table_scan directly - similar to how the SQL federation is doing it.

I do see how it would be nice though if that was already handled at a different layer so that the SQL/Substrait federation providers can just call the unparser directly on the plan its given.

backkem commented 1 month ago

Yea, I guess our use-case for this is somewhat different from federation. It's more related to creating a separate "model" or "semantic layer" on/over top of the storage.