apache / datafusion

Apache DataFusion SQL Query Engine
Apache License 2.0
5.47k stars 1.01k forks source link

[Epic] Extract catalog functionality from the core to make it more modular #10782

Open alamb opened 3 weeks ago

alamb commented 3 weeks ago

Is your feature request related to a problem or challenge?

As @goldmedal started trying to move the DynamicFileProvider so others could use it in https://github.com/apache/datafusion/pull/10745 I think it is clear that there is not a good way to add additional catalog support in the core without everything being intertwined.

Thus I think we should try and extract the different catalog providers out of datafusion core so it it easier

Describe the solution you'd like

I suggest the following final layout:

  1. traits like CatalogProvider, SchemaProvider, etc in a new crate datafusion-catalog (since these traits rely on table provider, etc I think this can't be in datafusion-common or datafusion-expr)
  2. The built in Memory* providers are in datafusion-catalog
  3. The bult in InformationSchema providers are in datafusion-catalog
  4. The newly proposed DynamicFileCatalog in datafusion-catalog
  5. (eventually) the LIstingTableProvider (which is by far the most complicated) moved to its own crate datafusion-catalog-listing

Describe alternatives you've considered

No response

Additional context

No response

alamb commented 3 weeks ago

If this seems like a reasonable idea to people I will file tickets to break down the work

cc @andygrove @jayzhan211 @comphead @mustafasrepo for your thoughts

lewiszlw commented 3 weeks ago

I agree with this direction. But now this seems hard to achieve because SchemaProvider depends on TableProvider and TableProvider depends on SessionState.

comphead commented 3 weeks ago

Thanks @alamb for starting this discussion. If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?

Like @lewiszlw correctly mentioned, we got some coupling between providers and the core. I'm just trying to understand the usecase when providers needed without the core

jayzhan211 commented 3 weeks ago

Is it due to the complexity of ListingTable so it has it's own crate? If they have common things then it is better to organize them into one crate. If ListingTable is so different than others, it is nice to have an independent crate

alamb commented 3 weeks ago

If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?

In my mind the real use usecase is to more easily use datafusion without having to bring in all the dependencies of LIstingTable (like parquet, avro, json, etc)

So the real usecase is getting ListingTable out of the core. But since the catalog API is in the core now there is no way to get ListingTable out of the core without also first moving the catalog API

Is it due to the complexity of ListingTable so it has it's own crate? If they have common things then it is better to organize them into one crate. If ListingTable is so different than others, it is nice to have an independent crate

I think both the complexity of ListingTable but also because if its dependency tree (e.g. parquet-rs and avro and json and object_store and ...)

For use cases like WASM it is quite messy to have the API split up like it currently is

alamb commented 3 weeks ago

I agree with this direction. But now this seems hard to achieve because SchemaProvider depends on TableProvider and TableProvider depends on SessionState.

I agree @lewiszlw -- well put. I made a first PR to start detangling things here: https://github.com/apache/datafusion/pull/10794 (it just splits SessionState into its own module)

Longer term we would have to figure out where SessionState would live (it still depends on several things in the core crate like datasource::provider and datasource::function 🤔

Maybe we could look into splitting out datafusion-datasource / datafusion-datasource-parquet / datafusion-datsource-avro, etc -- I don't have time to drive this at the moment but would be interested in helping anyone who did

comphead commented 3 weeks ago

If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?

In my mind the real use usecase is to more easily use datafusion without having to bring in all the dependencies of LIstingTable (like parquet, avro, json, etc)

So the real usecase is getting ListingTable out of the core. But since the catalog API is in the core now there is no way to get ListingTable out of the core without also first moving the catalog API

Hm... they probably thrive to have their own readers/writes perhaps other than arrow-rs implementation, that makes sense for me. And yes, if DF stands for extensibility we should make this happen. Not sure how difficult that can be though. We probably need to start with replacing core abstractions with traits instead of implementations to decouple it.

alamb commented 3 weeks ago

Hm... they probably thrive to have their own readers/writes perhaps other than arrow-rs implementation, that makes sense for me. And yes, if DF stands for extensibility we should make this happen. Not sure how difficult that can be though. We probably need to start with replacing core abstractions with traits instead of implementations to decouple it.

Yes something like this -- I think most of the traits already exist (e.g. CatalogProvider) but figuring out how to decouple SessionState (which is referred to by CatalogProvider is the trickiest bit I think)