Open alamb opened 3 weeks ago
If this seems like a reasonable idea to people I will file tickets to break down the work
cc @andygrove @jayzhan211 @comphead @mustafasrepo for your thoughts
I agree with this direction. But now this seems hard to achieve because SchemaProvider
depends on TableProvider
and TableProvider
depends on SessionState
.
Thanks @alamb for starting this discussion. If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?
Like @lewiszlw correctly mentioned, we got some coupling between providers and the core. I'm just trying to understand the usecase when providers needed without the core
Is it due to the complexity of ListingTable
so it has it's own crate? If they have common things then it is better to organize them into one crate. If ListingTable is so different than others, it is nice to have an independent crate
If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?
In my mind the real use usecase is to more easily use datafusion without having to bring in all the dependencies of LIstingTable (like parquet, avro, json, etc)
So the real usecase is getting ListingTable out of the core. But since the catalog API is in the core now there is no way to get ListingTable out of the core without also first moving the catalog API
Is it due to the complexity of ListingTable so it has it's own crate? If they have common things then it is better to organize them into one crate. If ListingTable is so different than others, it is nice to have an independent crate
I think both the complexity of ListingTable but also because if its dependency tree (e.g. parquet-rs and avro and json and object_store and ...)
For use cases like WASM
it is quite messy to have the API split up like it currently is
I agree with this direction. But now this seems hard to achieve because SchemaProvider depends on TableProvider and TableProvider depends on SessionState.
I agree @lewiszlw -- well put. I made a first PR to start detangling things here: https://github.com/apache/datafusion/pull/10794 (it just splits SessionState
into its own module)
Longer term we would have to figure out where SessionState
would live (it still depends on several things in the core crate like datasource::provider
and datasource::function
🤔
Maybe we could look into splitting out datafusion-datasource
/ datafusion-datasource-parquet
/ datafusion-datsource-avro
, etc -- I don't have time to drive this at the moment but would be interested in helping anyone who did
If I understood correctly there is a use case for using catalog abstractions/implementation but without datafusion core?
In my mind the real use usecase is to more easily use datafusion without having to bring in all the dependencies of LIstingTable (like parquet, avro, json, etc)
So the real usecase is getting ListingTable out of the core. But since the catalog API is in the core now there is no way to get ListingTable out of the core without also first moving the catalog API
Hm... they probably thrive to have their own readers/writes perhaps other than arrow-rs implementation, that makes sense for me. And yes, if DF stands for extensibility we should make this happen. Not sure how difficult that can be though. We probably need to start with replacing core abstractions with traits instead of implementations to decouple it.
Hm... they probably thrive to have their own readers/writes perhaps other than arrow-rs implementation, that makes sense for me. And yes, if DF stands for extensibility we should make this happen. Not sure how difficult that can be though. We probably need to start with replacing core abstractions with traits instead of implementations to decouple it.
Yes something like this -- I think most of the traits already exist (e.g. CatalogProvider
) but figuring out how to decouple SessionState (which is referred to by CatalogProvider
is the trickiest bit I think)
Is your feature request related to a problem or challenge?
As @goldmedal started trying to move the DynamicFileProvider so others could use it in https://github.com/apache/datafusion/pull/10745 I think it is clear that there is not a good way to add additional catalog support in the core without everything being intertwined.
Thus I think we should try and extract the different catalog providers out of datafusion core so it it easier
Describe the solution you'd like
I suggest the following final layout:
CatalogProvider
,SchemaProvider
, etc in a new cratedatafusion-catalog
(since these traits rely on table provider, etc I think this can't be indatafusion-common
ordatafusion-expr
)Memory*
providers are indatafusion-catalog
InformationSchema
providers are indatafusion-catalog
DynamicFileCatalog
indatafusion-catalog
datafusion-catalog-listing
Describe alternatives you've considered
No response
Additional context
No response