apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.21k stars 957 forks source link

Add `simplify` method to aggregate function #10354

Closed milenkovicm closed 6 days ago

milenkovicm commented 2 weeks ago

Which issue does this PR close?

Closes #9526.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

yes, added new test and example

Are there any user-facing changes?

no, changes are backward compatible

milenkovicm commented 2 weeks ago

the change is quite similar to #9906 there are quite a lot optional/unused parameters in the method call, but i personally find it easier to understand as it is equivalent to AggregateFunction signature.

I wonder if ExprSimplifyResult::<T>::Original(T) where T can be something else than args (tuples included) would be a backward compatible change. This way we could send and return all required parameters without copy ... but this is discussion for different time

milenkovicm commented 2 weeks ago

jsut thinking aloud, maybe if we change:

pub enum ExprSimplifyResult {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(Vec<Expr>),
}

to:

pub enum ExprSimplifyResult<T> {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(T),
}

simplify method would change from:

   pub fn simplify(
        &self,
        args: Vec<Expr>,
        distinct: &bool,
        filter: &Option<Box<Expr>>,
        order_by: &Option<Vec<Expr>>,
        null_treatment: &Option<NullTreatment>,
        info: &dyn SimplifyInfo,
    ) -> Result<ExprSimplifyResult> {

to:

   pub fn simplify(
        &self,
        args: Vec<Expr>,
        distinct: bool,
        filter: Option<Box<Expr>>,
        order_by: Option<Vec<Expr>>,
        null_treatment: Option<NullTreatment>,
        info: &dyn SimplifyInfo,
    ) -> Result<ExprSimplifyResult<(Vec<Expr>, bool, Option<Box<Expr>> ...)>> {
}

so in both cases, when we create replacement function or we re-assemble original function we would not need to clone parameters.

maybe hand full of changes, from -> Result<ExprSimplifyResult> to -> Result<ExprSimplifyResult<Vec<Expr>>>

Downside is that we would have to re-create expression if no simplification, plus change would be backward incompatible

jayzhan211 commented 2 weeks ago
pub enum ExprSimplifyResult<T> {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(T),
}

I think it is a good idea.

alamb commented 1 week ago

jsut thinking aloud, maybe if we change:

I think if we are going to change the signature, I think we should use Transformed directly (and it is already parameterized)

milenkovicm commented 1 week ago

Another option could be to make a simplify return optional closure

 fn simplify(&self ) -> Option<Box<dyn Fn(AggregateFunction) -> datafusion_common::tree_node::Transformed<Expr>>> {
        None
 }

in this case expr_simplifier part would become:

Expr::AggregateFunction(AggregateFunction {func_def: AggregateFunctionDefinition::UDF(ref udaf), ..}) => {
  match (udaf.simplify(), expr) {
    (Some(simplify), Expr::AggregateFunction(a)) => simplify(a),
    (_, e) => Transformed::no(e),
  }
}

this solution:

other issues:

jayzhan211 commented 1 week ago

Another option is to introduce has_simplify so we can simplify the code to

            Expr::AggregateFunction(AggregateFunction {
                func_def: AggregateFunctionDefinition::UDF(udaf),
                args: AggregateFunctionArgs,
            }) if udaf.has_simplify() => {
                Transformed::yes(udaf.simplify()?)
            }

            fn simplify() -> Result<Expr>

The downside is that the user needs to both define simplify and set has_simplify to true, but I think it is much simpler than the optional closure

milenkovicm commented 1 week ago

The downside is that the user needs to both define simplify and set has_simplify to true, but I think it is much simpler than the optional closure

I personally try to avoid situations like this when creating API, this should be documented that both functions have to be implemented plus it introduces more corner cases.

Anyway, closure was a brain dump from yesterday, after a good sleep I don't think we need it, will try to elaborate in next comment

milenkovicm commented 1 week ago

Option 1 - Original expression as a parameter

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know it'll always be AggregateFunction but we have to match on it 
        // which is not terribly hard but it makes api a bit odd 
        if let Expr::AggregateFunction(agg) = expr {
            ...
            Ok(datafusion_common::tree_node::Transformed::yes(...))
        } else {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
        }
    }

default implementation is just:

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
    }

Option 2 - AggregationFunction as a parameter

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know its aggregate function, no need to do extra matching, but we need to re-create expression 
        // issue user can return `no` with new expression in it 
        Ok(datafusion_common::tree_node::Transformed::no(Expr::AggregateFunction(agg)))
    }

Option 3 - AggregationFunction as a parameter with parametrised ExprSimplifyResult

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<ExprSimplifyResult<AggregateFunction>> {
       // user cant return `no` with new expression 
       // simplifier will re-crete expression in case original has been returned 
        Ok(ExprSimplifyResult::Original(agg))        
   }

IMHO, I prefer option number 3, which would involve parametrizing ExprSimplifyResult, second best is option number 2. wdyt @jayzhan211, @alamb

jayzhan211 commented 1 week ago

Option 1 - Original expression as a parameter

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know it'll always be AggregateFunction but we have to match on it 
        // which is not terribly hard but it makes api a bit odd 
        if let Expr::AggregateFunction(agg) = expr {
            ...
            Ok(datafusion_common::tree_node::Transformed::yes(...))
        } else {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
        }
    }

default implementation is just:

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
    }

Option 2 - AggregationFunction as a parameter

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know its aggregate function, no need to do extra matching, but we need to re-create expression 
        // issue user can return `no` with new expression in it 
        Ok(datafusion_common::tree_node::Transformed::no(Expr::AggregateFunction(agg)))
    }

Option 3 - AggregationFunction as a parameter with parametrised ExprSimplifyResult

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<ExprSimplifyResult<AggregateFunction>> {
       // user cant return `no` with new expression 
       // simplifier will re-crete expression in case original has been returned 
        Ok(ExprSimplifyResult::Original(agg))        
   }

IMHO, I prefer option number 3, which would involve parametrizing ExprSimplifyResult, second best is option number 2. wdyt @jayzhan211, @alamb

Given these 3, I prefer 2, because I think re-create expression is not a high cost that we should avoid. And we can do even further,

If user does not implement simplify, it is equivalent to Transformed::no or ExprSimplifyResult::Original, so we can just have Result<Expr>

fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<Expr> {
        Ok(Expr::AggregateFunction(agg))
    }
milenkovicm commented 1 week ago

I apologise for making noise, but it looks like all of the non closure options have issue with borrowing:

error[E0505]: cannot move out of `expr.0` because it is borrowed
    --> datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:1391:48
     |
1388 |                 func_def: AggregateFunctionDefinition::UDF(ref udaf),
     |                                                            -------- borrow of `expr.0.func_def.0` occurs here
...
1391 |                 if let Expr::AggregateFunction(aggregate_function) = expr {
     |                                                ^^^^^^^^^^^^^^^^^^ move out of `expr.0` occurs here
1392 |                     udaf.simplify(aggregate_function, info)?
     |                     ---- borrow later used here

It looks like AggregateArg might be back on the table, which would involve cloning udf

milenkovicm commented 1 week ago

@jayzhan211 and @alamb whenever you get chance please have a look, IMHO I find proposal with closure to tick all the boxes, but this one is definitely simpler.

jayzhan211 commented 1 week ago

@jayzhan211 and @alamb whenever you get chance please have a look, IMHO I find proposal with closure to tick all the boxes, but this one is definitely simpler.

After playing around, I feel like maybe optional closure is our answer 🤔

fn simplify(
        &self,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Expr>> {}
milenkovicm commented 1 week ago

After playing around, I feel like maybe optional closure is our answer 🤔

Damn! I should have created a branch with that code 😀, if we have consensus on that direction I'll redo it

jayzhan211 commented 1 week ago

After playing around, I feel like maybe optional closure is our answer 🤔

Damn! I should have created a branch with that code 😀, if we have consensus on that direction I'll redo it

My final answer for today is

// UDAF
fn simplify(
        &self,
        args: AggregateArgs,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Result<Expr>>> {
// UDF, not necessary but for consistency
fn simplify(
        &self,
        args: Vec<Expr>,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(Vec<Expr>) -> Result<Expr>>> {

Go with optional closure!

milenkovicm commented 1 week ago

My final answer for today is

fn simplify(
        &self,
        args: AggregateArgs,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Result<Expr>>> {

Go with optional closure!

you mean

// UDF
fn simplify(
        &self,
    ) -> Option<Box<dyn Fn(AggregateFunction, &dyn SimplifyInfo) -> Result<Expr>>> {

or

// UDF
fn simplify(
        &self,
        info: &dyn SimplifyInfo
    ) -> Option<Box<dyn Fn(AggregateFunction) -> Result<Expr>>> {
jayzhan211 commented 1 week ago

It should be this.

// UDF
fn simplify(
        &self,
    ) -> Option<Box<dyn Fn(AggregateFunction, &dyn SimplifyInfo) -> Result<Expr>>> {

The reason for optional closure is that I assume we need to return Expr for simplified cases. If we expect AggregateArgs, then we might not need closure.

milenkovicm commented 1 week ago

We're on the same page @jayzhan211

jayzhan211 commented 1 week ago

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have

pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}

I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example.

@milenkovicm

milenkovicm commented 1 week ago

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have

pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}

I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example.

@milenkovicm

I think we need Expr, this way we can replace function with something else

jayzhan211 commented 1 week ago

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have

pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}

I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example. @milenkovicm

I think we need Expr, this way we can replace function with something else

I agree.

jayzhan211 commented 1 week ago

You can define closure in datafusion/expr/src/function.rs

alamb commented 6 days ago

Thanks again @jayzhan211 and @milenkovicm 🙏