apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.04k stars 1.14k forks source link

make `ExecutionPlan` self contained. (allow for computed properties `PlanProperties`) #10018

Open universalmind303 opened 5 months ago

universalmind303 commented 5 months ago

Is your feature request related to a problem or challenge?

if you have some structs that you want to loosely couple to datafusion, it is now impossible.

for example with <37

#[derive(Debug, Clone)]
pub struct MyStruct {
    projection: Arc<Schema>,
    predicate_projection: Arc<Schema>,
    // ...
}

and then you can just implement the trait as one would expect

impl ExecutionPlan for MyStruct {

    fn output_partitioning(&self) -> Partitioning {
        Partitioning::UnknownPartitioning(1)
    }

    fn output_ordering(&self) -> Option<&[datafusion::physical_expr::PhysicalSortExpr]> {
        None
    }
 // ...   
}

But with datafusion 37, the datafusion specific components are now leaked into the outer struct, and force the user to modify the struct to contain PlanProperties.

impl ExecutionPlan for MyStruct {
  fn properties(&self) -> &PlanProperties {
    // no possible way to create `&PlanProperties`. 
    // the only option is to add it as a field on `MyStruct`
    // which breaks the encapsulation of `ExecutionPlan`
  }
}

This is especially amplified if you want to feature flag datafusion specific functionality.

datafusion <37

#[derive(Debug, Clone)]
pub struct MyStruct {
    // ...
}

impl MyStruct {
  pub fn new() -> Self {
     Self {
       // ..
     }
  }
}

#[cfg(feature = "datafusion")]
impl ExecutionPlan for MyStruct {
  // ...
}

datafusion 37


#[derive(Debug, Clone)]
pub struct MyStruct {
    // ...
    #[cfg(feature = "datafusion")]
    plan_properties: PlanProperties
}

impl MyStruct {
  #[cfg(feature = "datafusion")]
  pub fn new() -> Self {
     let properties = make_properties_somehow():
     Self {
       // ..
       plan_properties: properties
     }
  }
  #[cfg(not(feature = "datafusion"))]
    pub fn new() -> Self {
     Self {
       // ..
     }
  }
}

#[cfg(feature = "datafusion)]
impl ExecutionPlan for MyStruct {
  // ...
}

Describe the solution you'd like

modify ExecutionPlan to not have any methods that return a reference. Specifically ExecutionPlan::properties

If a reference is wanted for performance reasons, we should instead use Cow<PlanProperties>.

Describe alternatives you've considered

NA

Additional context

No response

alamb commented 5 months ago

cc @mustafasrepo for your comments

alamb commented 5 months ago

Thank you for this idea @universalmind303

Cow<PlanProperties> seems like a good choice to me.

Or else switching the trait back to only expose fields?

We found it awkward to implement the new API in InfluxDB 3.0 as well (though we did make it work)

Putting PlanProperties into the ExecutionPlan was to done to avoid recomputing the same (potentially expensive) properties over and over again as part of https://github.com/apache/arrow-datafusion/pull/9346

mustafasrepo commented 5 months ago

I think returning Cow<PlanProperties> makes sense. In this way, struct won't need to implement a new member as you indicated. We might need to update methods of the below

impl ExecutionPlanProperties for Arc<dyn ExecutionPlan> {
    fn output_partitioning(&self) -> &Partitioning {
        self.properties().output_partitioning()
    }

    fn execution_mode(&self) -> ExecutionMode {
        self.properties().execution_mode()
    }

    fn output_ordering(&self) -> Option<&[PhysicalSortExpr]> {
        self.properties().output_ordering()
    }

    fn equivalence_properties(&self) -> &EquivalenceProperties {
        self.properties().equivalence_properties()
    }
}

to work with Cow<PlanProperties> though.