Open jayzhan211 opened 3 months ago
In theory I think you are right.
However, I am worried about removing SessionState from the scan
method as it is so widely used and has everything people need. In fact here is at least one request to use SessionState more: https://github.com/apache/datafusion/issues/11193 from @cisaacson
Here is one alternate proposal https://github.com/apache/datafusion/issues/10782#issuecomment-2226248126 (basically treat SessionState as part of the catalog API).
ANother potentially hacky idea is to do something like pass SessionState as an dyn Any
--
async fn scan(
&self,
state: &Any,
projection: Option<&Vec<usize>>,
_filters: &[Expr],
_limit: Option<usize>,
) -> Result<Arc<dyn ExecutionPlan>> {
Which would break the explicit dependency but implementors of scan
could still get it by downcasting
let state = state.as_any().downcast_ref::<SessionState>().unwrap();
🤔
@alamb I agree, removing it or using Any
would make lots of things challenging. The idea of state for a SessionContext that is accessible from a variety of places is important.
The easiest way to address this is to move SessionState
to a smaller sub-crate like common-runtime
or even common-traits
(although the latter would be complicated as many traits depend on structs and other types).
I see why it is used in so many places, and I do support the idea that it is important to do so.
Perhaps we could just identify all truly common types like this and move them into a sub-crate, that could be done without too much effort. Such a change so incorporate other types of common objects too. In this way anytime someone wants to break something out of core
the most likely common dependencies would be available in the smaller sub-crate.
Is your feature request related to a problem or challenge?
There are many functions in
datafusion-core
that takeSessionState
as arguments but only actually rely on portion of them. This add the additional dependency that is not necessary, therefore blocking us from extracting module out of core #10782.For example, If we want to pull
CatalogProvider
out of core, we need to pull outTableProvider
first. But because it hasscan
function that takesSessionState
which containsCatalogProviderList
therefore there is a circular dependency. Similar issues are already mentioned in https://github.com/apache/datafusion/issues/11182Describe the solution you'd like
I think we need to redesign those functions that take
SessionState
and minimize the dependencies for them.Given one of the
scan
function here, we can see that we only needstate.config_options().explain
andstate.execution_props()
instead of the wholeSessionState
https://github.com/apache/datafusion/blob/4bed04e4e312a0b125306944aee94a93c2ff6c4f/datafusion/core/src/datasource/memory.rs#L207-L245In this case, we can create
TableProviderConext
that encapsulates a subset of the information fromSessionState
.The same idea applies to
PhysicalPlanner
, we usually just needPhysicalOptimizeRules
or other information about plan. https://github.com/apache/datafusion/blob/4bed04e4e312a0b125306944aee94a93c2ff6c4f/datafusion/core/src/physical_planner.rs#L364-L385And,
create_initial_plan
inDefaultPhysicalPlanner
,ExecutionOptions
is all we need, nothing else. https://github.com/apache/datafusion/blob/4bed04e4e312a0b125306944aee94a93c2ff6c4f/datafusion/core/src/physical_planner.rs#L582-L585Describe alternatives you've considered
No response
Additional context
No response