apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.49k stars 1.02k forks source link

Break datafusion-catalog code into its own crate #11182

Open alamb opened 2 days ago

alamb commented 2 days ago

Is your feature request related to a problem or challenge?

As we start thinking of adding new catalog support in DataFusion for example, @goldmedal tried to do this for DynamicFileCatalog in #11035 , the current code structure requires that any such implementation be in datafusion/core which is already quite large and monolithic

Also, I think long term it would be a more sustainable structure if we can move ListingTable out of the core as well

However, this move is not likely possible until we move table provider and related catalog traits out of the core.

Also, if we could split up datafusion core more, it would be easier for people to use DataFusion as a smaller embedded library (aka not bring in file format support if they didn't need it)

Describe the solution you'd like

Ideally I would like the following traits / code moved into a datafusion-catalog crate:

Describe alternatives you've considered

No response

Additional context

No response

alamb commented 2 days ago

I looked into this --

One of the major challenges is that SessionState's constructor basically installs the "pre-provided" functionality (like data sources, etc

https://github.com/apache/datafusion/blob/78055fe13b64cdfb4accd131e9a462770c21a934/datafusion/core/src/execution/session_state.rs#L190-L295

One way to handle this would be to allow constructing SessionState with the minimal built in features, and have a function in SessionContext like SessionContext::register_built_ins that would register things like listing table, information schema, etc.

That way it would still be easy to use DataFusion with a minimal SessionState but also easily register all the built in extensions 🤔

alamb commented 2 days ago

I made https://github.com/apache/datafusion/pull/11183 to start breaking apart the API and implementation -- there is still a ways to go

devinjdangelo commented 2 days ago

@alamb I took a stab at moving parquet functionality into a datafusion-parquet crate (#11188) , and I ran into similar challenges you highlight here. I think to accomplish these goals corewill need to be refactored into a number of different crates to avoid circular dependencies and allow core to still offer a batteries included experience.

alamb commented 2 days ago

I think to accomplish these goals core will need to be refactored into a number of different crates to avoid circular dependencies and allow core to still offer a batteries included experience.

I agree with this entirely

The center of the knot is SessionState I think -- figuring out how to get that out of the core is likely key to breaking things up reasonably

alamb commented 1 day ago

The more I think about this the more I think trying to make SessionState a container that doesn't have all the optional features (like parquet support) by default makes sense

So like

let session_state = SessionState::new();
// no table providers, etc
// install standard built in table providers
SessionContex::install_built_in(&mut session_state);
// now session_state has them here