Open alamb opened 2 days ago
I looked into this --
One of the major challenges is that SessionState's
constructor basically installs the "pre-provided" functionality (like data sources, etc
One way to handle this would be to allow constructing SessionState with the minimal built in features, and have a function in SessionContext
like SessionContext::register_built_ins
that would register things like listing table, information schema, etc.
That way it would still be easy to use DataFusion with a minimal SessionState but also easily register all the built in extensions 🤔
I made https://github.com/apache/datafusion/pull/11183 to start breaking apart the API and implementation -- there is still a ways to go
@alamb I took a stab at moving parquet functionality into a datafusion-parquet crate (#11188) , and I ran into similar challenges you highlight here. I think to accomplish these goals core
will need to be refactored into a number of different crates to avoid circular dependencies and allow core to still offer a batteries included experience.
I think to accomplish these goals core will need to be refactored into a number of different crates to avoid circular dependencies and allow core to still offer a batteries included experience.
I agree with this entirely
The center of the knot is SessionState I think -- figuring out how to get that out of the core is likely key to breaking things up reasonably
The more I think about this the more I think trying to make SessionState
a container that doesn't have all the optional features (like parquet support) by default makes sense
So like
let session_state = SessionState::new();
// no table providers, etc
// install standard built in table providers
SessionContex::install_built_in(&mut session_state);
// now session_state has them here
Is your feature request related to a problem or challenge?
As we start thinking of adding new catalog support in DataFusion for example, @goldmedal tried to do this for DynamicFileCatalog in #11035 , the current code structure requires that any such implementation be in
datafusion/core
which is already quite large and monolithicAlso, I think long term it would be a more sustainable structure if we can move ListingTable out of the core as well
However, this move is not likely possible until we move table provider and related catalog traits out of the core.
Also, if we could split up datafusion core more, it would be easier for people to use DataFusion as a smaller embedded library (aka not bring in file format support if they didn't need it)
Describe the solution you'd like
Ideally I would like the following traits / code moved into a
datafusion-catalog
crate:TableProvider
CatalogProvider
SchemaProvider
MemoryCatalogProvider
InformationSchemaProvider
Describe alternatives you've considered
No response
Additional context
No response