Add `simplify` method to aggregate function

milenkovicm commented 2 weeks ago

Which issue does this PR close?

Closes #9526.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

yes, added new test and example

Are there any user-facing changes?

no, changes are backward compatible

milenkovicm commented 2 weeks ago

the change is quite similar to #9906 there are quite a lot optional/unused parameters in the method call, but i personally find it easier to understand as it is equivalent to AggregateFunction signature.

I wonder if ExprSimplifyResult::<T>::Original(T) where T can be something else than args (tuples included) would be a backward compatible change. This way we could send and return all required parameters without copy ... but this is discussion for different time

milenkovicm commented 2 weeks ago

jsut thinking aloud, maybe if we change:

pub enum ExprSimplifyResult {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(Vec<Expr>),
}

to:

pub enum ExprSimplifyResult<T> {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(T),
}

simplify method would change from:

   pub fn simplify(
        &self,
        args: Vec<Expr>,
        distinct: &bool,
        filter: &Option<Box<Expr>>,
        order_by: &Option<Vec<Expr>>,
        null_treatment: &Option<NullTreatment>,
        info: &dyn SimplifyInfo,
    ) -> Result<ExprSimplifyResult> {

to:

   pub fn simplify(
        &self,
        args: Vec<Expr>,
        distinct: bool,
        filter: Option<Box<Expr>>,
        order_by: Option<Vec<Expr>>,
        null_treatment: Option<NullTreatment>,
        info: &dyn SimplifyInfo,
    ) -> Result<ExprSimplifyResult<(Vec<Expr>, bool, Option<Box<Expr>> ...)>> {
}

so in both cases, when we create replacement function or we re-assemble original function we would not need to clone parameters.

maybe hand full of changes, from -> Result<ExprSimplifyResult> to -> Result<ExprSimplifyResult<Vec<Expr>>>

Downside is that we would have to re-create expression if no simplification, plus change would be backward incompatible

jayzhan211 commented 2 weeks ago

pub enum ExprSimplifyResult<T> {
    /// The function call was simplified to an entirely new Expr
    Simplified(Expr),
    /// the function call could not be simplified, and the arguments
    /// are return unmodified.
    Original(T),
}

I think it is a good idea.

alamb commented 1 week ago

jsut thinking aloud, maybe if we change:

I think if we are going to change the signature, I think we should use Transformed directly (and it is already parameterized)

milenkovicm commented 1 week ago

Another option could be to make a simplify return optional closure

 fn simplify(&self ) -> Option<Box<dyn Fn(AggregateFunction) -> datafusion_common::tree_node::Transformed<Expr>>> {
        None
 }

in this case expr_simplifier part would become:

Expr::AggregateFunction(AggregateFunction {func_def: AggregateFunctionDefinition::UDF(ref udaf), ..}) => {
  match (udaf.simplify(), expr) {
    (Some(simplify), Expr::AggregateFunction(a)) => simplify(a),
    (_, e) => Transformed::no(e),
  }
}

this solution:

avoids cloning,
avoids decomposition/composition of function if not needed
avoids adding new struct
but, it has dynamic dispatch simplify function call (in case simplify is defined)
and a bit odd interface

other issues:

SimplifyInfo is missing from signature, can be added
Transformed::no does not make sense in this context as original expression has not been propagated (should we use actual expression instead of AggregateFunction or remove Transformed from return type
closure should return result with or without Transformed

jayzhan211 commented 1 week ago

Another option is to introduce has_simplify so we can simplify the code to

            Expr::AggregateFunction(AggregateFunction {
                func_def: AggregateFunctionDefinition::UDF(udaf),
                args: AggregateFunctionArgs,
            }) if udaf.has_simplify() => {
                Transformed::yes(udaf.simplify()?)
            }

            fn simplify() -> Result<Expr>

The downside is that the user needs to both define simplify and set has_simplify to true, but I think it is much simpler than the optional closure

milenkovicm commented 1 week ago

The downside is that the user needs to both define simplify and set has_simplify to true, but I think it is much simpler than the optional closure

I personally try to avoid situations like this when creating API, this should be documented that both functions have to be implemented plus it introduces more corner cases.

Anyway, closure was a brain dump from yesterday, after a good sleep I don't think we need it, will try to elaborate in next comment

milenkovicm commented 1 week ago

Option 1 - Original expression as a parameter

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know it'll always be AggregateFunction but we have to match on it 
        // which is not terribly hard but it makes api a bit odd 
        if let Expr::AggregateFunction(agg) = expr {
            ...
            Ok(datafusion_common::tree_node::Transformed::yes(...))
        } else {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
        }
    }

default implementation is just:

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
    }

Option 2 - AggregationFunction as a parameter

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know its aggregate function, no need to do extra matching, but we need to re-create expression 
        // issue user can return `no` with new expression in it 
        Ok(datafusion_common::tree_node::Transformed::no(Expr::AggregateFunction(agg)))
    }

Option 3 - AggregationFunction as a parameter with parametrised `ExprSimplifyResult`

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<ExprSimplifyResult<AggregateFunction>> {
       // user cant return `no` with new expression 
       // simplifier will re-crete expression in case original has been returned 
        Ok(ExprSimplifyResult::Original(agg))        
   }

IMHO, I prefer option number 3, which would involve parametrizing ExprSimplifyResult, second best is option number 2. wdyt @jayzhan211, @alamb

jayzhan211 commented 1 week ago

Option 1 - Original expression as a parameter

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know it'll always be AggregateFunction but we have to match on it 
        // which is not terribly hard but it makes api a bit odd 
        if let Expr::AggregateFunction(agg) = expr {
            ...
            Ok(datafusion_common::tree_node::Transformed::yes(...))
        } else {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
        }
    }

default implementation is just:

    fn simplify(
        &self,
        expr: Expr, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
            // no re-creation of expression if not used 
            Ok(datafusion_common::tree_node::Transformed::no(expr))
    }

Option 2 - AggregationFunction as a parameter

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<datafusion_common::tree_node::Transformed<Expr>> {
        // we know its aggregate function, no need to do extra matching, but we need to re-create expression 
        // issue user can return `no` with new expression in it 
        Ok(datafusion_common::tree_node::Transformed::no(Expr::AggregateFunction(agg)))
    }

Option 3 - AggregationFunction as a parameter with parametrised `ExprSimplifyResult`

    fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<ExprSimplifyResult<AggregateFunction>> {
       // user cant return `no` with new expression 
       // simplifier will re-crete expression in case original has been returned 
        Ok(ExprSimplifyResult::Original(agg))        
   }

IMHO, I prefer option number 3, which would involve parametrizing ExprSimplifyResult, second best is option number 2. wdyt @jayzhan211, @alamb

Given these 3, I prefer 2, because I think re-create expression is not a high cost that we should avoid. And we can do even further,

If user does not implement simplify, it is equivalent to Transformed::no or ExprSimplifyResult::Original, so we can just have Result<Expr>

fn simplify(
        &self,
        agg: AggregateFunction, 
    ) -> Result<Expr> {
        Ok(Expr::AggregateFunction(agg))
    }

milenkovicm commented 1 week ago

I apologise for making noise, but it looks like all of the non closure options have issue with borrowing:

error[E0505]: cannot move out of `expr.0` because it is borrowed
    --> datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs:1391:48
     |
1388 |                 func_def: AggregateFunctionDefinition::UDF(ref udaf),
     |                                                            -------- borrow of `expr.0.func_def.0` occurs here
...
1391 |                 if let Expr::AggregateFunction(aggregate_function) = expr {
     |                                                ^^^^^^^^^^^^^^^^^^ move out of `expr.0` occurs here
1392 |                     udaf.simplify(aggregate_function, info)?
     |                     ---- borrow later used here

It looks like AggregateArg might be back on the table, which would involve cloning udf

milenkovicm commented 1 week ago

@jayzhan211 and @alamb whenever you get chance please have a look, IMHO I find proposal with closure to tick all the boxes, but this one is definitely simpler.

jayzhan211 commented 1 week ago

@jayzhan211 and @alamb whenever you get chance please have a look, IMHO I find proposal with closure to tick all the boxes, but this one is definitely simpler.

After playing around, I feel like maybe optional closure is our answer 🤔

fn simplify(
        &self,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Expr>> {}

milenkovicm commented 1 week ago

After playing around, I feel like maybe optional closure is our answer 🤔

Damn! I should have created a branch with that code 😀, if we have consensus on that direction I'll redo it

jayzhan211 commented 1 week ago

After playing around, I feel like maybe optional closure is our answer 🤔

Damn! I should have created a branch with that code 😀, if we have consensus on that direction I'll redo it

My final answer for today is

// UDAF
fn simplify(
        &self,
        args: AggregateArgs,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Result<Expr>>> {

// UDF, not necessary but for consistency
fn simplify(
        &self,
        args: Vec<Expr>,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(Vec<Expr>) -> Result<Expr>>> {

Go with optional closure!

milenkovicm commented 1 week ago

My final answer for today is

fn simplify(
        &self,
        args: AggregateArgs,
        _info: &dyn SimplifyInfo,
    ) -> Option<Box<dyn Fn(AggregateArgs) -> Result<Expr>>> {

Go with optional closure!

you mean

// UDF
fn simplify(
        &self,
    ) -> Option<Box<dyn Fn(AggregateFunction, &dyn SimplifyInfo) -> Result<Expr>>> {

or

// UDF
fn simplify(
        &self,
        info: &dyn SimplifyInfo
    ) -> Option<Box<dyn Fn(AggregateFunction) -> Result<Expr>>> {

jayzhan211 commented 1 week ago

It should be this.

// UDF
fn simplify(
        &self,
    ) -> Option<Box<dyn Fn(AggregateFunction, &dyn SimplifyInfo) -> Result<Expr>>> {

The reason for optional closure is that I assume we need to return Expr for simplified cases. If we expect AggregateArgs, then we might not need closure.

milenkovicm commented 1 week ago

We're on the same page @jayzhan211

jayzhan211 commented 1 week ago

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have

pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}

I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example.

@milenkovicm

milenkovicm commented 1 week ago

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have
pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}
I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example.

@milenkovicm

I think we need Expr, this way we can replace function with something else

jayzhan211 commented 1 week ago

We're on the same page @jayzhan211

If we don't need Expr for simplified UDAF, than we can have
pub fn simplify(
        &self,
        args: AggregateArgs,
        info: &dyn SimplifyInfo,
    ) -> Result<Transformed<AggregateArgs>> {}
I think the next problem is whether we need Expr or just AggregateArgs. The former can let us return the UDAF with a different function. But for this kind of function rewrite, we can handle it in FuncitonWrite. You can take ArrayFunctionRewriter for example. @milenkovicm
I think we need Expr, this way we can replace function with something else

I agree.

jayzhan211 commented 1 week ago

You can define closure in datafusion/expr/src/function.rs

alamb commented 6 days ago

Thanks again @jayzhan211 and @milenkovicm 🙏

apache / datafusion