apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.22k stars 1.18k forks source link

Minimize the dependency on `SessionState` #11420

Open jayzhan211 opened 3 months ago

jayzhan211 commented 3 months ago

Is your feature request related to a problem or challenge?

There are many functions in datafusion-core that take SessionState as arguments but only actually rely on portion of them. This add the additional dependency that is not necessary, therefore blocking us from extracting module out of core #10782.

For example, If we want to pull CatalogProvider out of core, we need to pull out TableProvider first. But because it has scan function that takes SessionState which contains CatalogProviderList therefore there is a circular dependency. Similar issues are already mentioned in https://github.com/apache/datafusion/issues/11182

Describe the solution you'd like

I think we need to redesign those functions that take SessionState and minimize the dependencies for them.

Given one of the scan function here, we can see that we only need state.config_options().explain and state.execution_props() instead of the whole SessionState https://github.com/apache/datafusion/blob/4bed04e4e312a0b125306944aee94a93c2ff6c4f/datafusion/core/src/datasource/memory.rs#L207-L245

In this case, we can create TableProviderConext that encapsulates a subset of the information from SessionState.

pub struct TableProviderConext {
    explain_options: ExplainOptions,
    executionProps: ExecutionProps // maybe with reference to avoid copy
}

The same idea applies to PhysicalPlanner, we usually just need PhysicalOptimizeRules or other information about plan. https://github.com/apache/datafusion/blob/4bed04e4e312a0b125306944aee94a93c2ff6c4f/datafusion/core/src/physical_planner.rs#L364-L385

And, create_initial_plan in DefaultPhysicalPlanner, ExecutionOptions is all we need, nothing else. https://github.com/apache/datafusion/blob/4bed04e4e312a0b125306944aee94a93c2ff6c4f/datafusion/core/src/physical_planner.rs#L582-L585

Describe alternatives you've considered

No response

Additional context

No response

alamb commented 3 months ago

In theory I think you are right.

However, I am worried about removing SessionState from the scan method as it is so widely used and has everything people need. In fact here is at least one request to use SessionState more: https://github.com/apache/datafusion/issues/11193 from @cisaacson

Here is one alternate proposal https://github.com/apache/datafusion/issues/10782#issuecomment-2226248126 (basically treat SessionState as part of the catalog API).

ANother potentially hacky idea is to do something like pass SessionState as an dyn Any --

 async fn scan( 
     &self, 
     state: &Any, 
     projection: Option<&Vec<usize>>, 
     _filters: &[Expr], 
     _limit: Option<usize>, 
 ) -> Result<Arc<dyn ExecutionPlan>> { 

Which would break the explicit dependency but implementors of scan could still get it by downcasting

let state = state.as_any().downcast_ref::<SessionState>().unwrap();

🤔

cisaacson commented 3 months ago

@alamb I agree, removing it or using Any would make lots of things challenging. The idea of state for a SessionContext that is accessible from a variety of places is important.

The easiest way to address this is to move SessionState to a smaller sub-crate like common-runtime or even common-traits (although the latter would be complicated as many traits depend on structs and other types).

I see why it is used in so many places, and I do support the idea that it is important to do so.

Perhaps we could just identify all truly common types like this and move them into a sub-crate, that could be done without too much effort. Such a change so incorporate other types of common objects too. In this way anytime someone wants to break something out of core the most likely common dependencies would be available in the smaller sub-crate.