Closed hozan23 closed 1 month ago
Makes sense to me. One caveat we've run into is that you can't blindly rewrite the plan on the DataFusion side, since DataFusion still needs the table names it knows about to work properly. The LogicalPlan generated by rewrite_table_scan
is only useful as input to the Unparser
Yea, good point.
We have an implementation where we basically have two separate schemas: the external or "model" schema represented by a ModelSchemaProvider
and ModelTableProvider
s and a internal/storage schema represented by a StorageSchemaProvider
and StorageTableProvider
s. The former "model" schema is used during query parsing. The ModelTableProvider
s have a method to fetch the corresponding StorageTableProvider
. It's the rewrite analyser that rewrites the entire LogicalPlan from using ModelTableProvider
s to StorageTableProvider
s. The storage schema is used during execution.
I'd love to hear your insight on that approach. I think it's more general. For example, it would work when using plain table providers or when trying to federate using Substrait plans. On the other hand, it could introduce too many new abstraction into the mix.
In general I'm a fan of simpler systems when possible. I'm not really convinced that adding these abstractions buys us all that much. i.e. for plain table providers, its relatively straightforward to keep state on the internal name and when you call .scan()
on it, you don't even have to care what the name datafusion has for you (we do this for all of our non-federated custom providers). For Substrait plan federation, I would just have its implementation call rewrite_table_scan
directly - similar to how the SQL federation is doing it.
I do see how it would be nice though if that was already handled at a different layer so that the SQL/Substrait federation providers can just call the unparser directly on the plan its given.
Yea, I guess our use-case for this is somewhat different from federation. It's more related to creating a separate "model" or "semantic layer" on/over top of the storage.
Hello,
We are thinking of moving the recently added
rewrite_table_scan
function into anAnalyzerRule
instead of running it in theExecutionPlan
. This change makes sense since the AnalyzerRule is responsible for the semantic changes when transforming theLogicalPlan
s.What do you think? @phillipleblanc
cc: @backkem