GreptimeTeam / greptimedb

An open-source, cloud-native, unified time series database for metrics, logs and events with SQL/PromQL supported. Available on GreptimeCloud.
https://greptime.com/
Apache License 2.0
4.35k stars 315 forks source link

High level framework for ergonomic UDF programming #4856

Open sunng87 opened 1 month ago

sunng87 commented 1 month ago

What type of enhancement is this?

Refactor

What does the enhancement do?

The idea is to create a high level framework for UDF development (not UDAF), to remove boilerplate code, and improve ergonomic.

The core responsibility of this framework is to provide:

Current status

At the moment, a typical implementation of UDF looks like this one: https://github.com/GreptimeTeam/greptimedb/blob/main/src/common/function/src/scalars/geo/h3.rs#L95

Basically we do following steps to generate the result vector:

  1. Validate input columns: column size and length for each column
  2. Initialise the result vector
  3. Extract values by row from columns
  4. Call the rust function and dealing with error if any
  5. Fill the result vector

Desired state

Because every implementation has do these 1/2/3/5 steps. An ergonomic solution is to provide a declarative way to extract rust data types from column vectors, and the user simply focus on calling rust function. The implementation of UDF should be stateless, so until we have a real case, we don't need to provide any type of context for execution except the original FunctionContext.

Inspired by how axum designed its web handler. The API looks like


trait Extract {
    fn extract(v: Value) -> Option<Self>;

    fn validate(&self) -> Result<()>;
}

struct Coordinate(f64);

impl Extract for Coordinate {

    fn extract(v: Value) -> Option<Self> {
        ...
    }

    fn validate(&self) -> Result<()> {
        Ok(())
    }
}

struct Resolution(i8);

impl Extract for Coordinate {

    fn extract(v: Value) -> Option<Self> {
        ...
    }

    fn validate(&self) -> Result<()> {
        ensure!(self.0 >=0 && self.0 < 18)
    }
}

trait FunctionExt1: Function {
    type T0: Extract;
    fn call(_ctx: FunctionContext, arg0: T0) -> R;
}

trait FunctionExt2: Function {
    type T0: Extract;
    type T1: Extract;
    fn call(_ctx: FunctionContext, arg0: T0, arg1: T1) -> R;
}

trait FunctionExt3: Function {
    type T0: Extract;
    type T1: Extract;
    type T2: Extract;
    fn call(_ctx: FunctionContext, arg0: T0, arg1: T1, arg2: T2) -> R;
}

...

FunctionExtN will provide default implementation for Function::eval.

TODO: think about how to detail with R

Limitation

Documentation

Procedural macro is preferred in this case for two types of usage:

Implementation challenges

No response

evenyag commented 1 month ago

@sunng87 We are going to remove the wrapper layer of our UDF/UDAF and use datafusion's UDF API in the future. Not sure if this issue can benefit from it.

sunng87 commented 1 month ago

@evenyag If we use datafusion's API, are we still using our own Vector as input?

evenyag commented 4 weeks ago

@evenyag If we use datafusion's API, are we still using our own Vector as input?

No, we process arrow's arrays directly. Writing a simple UDF should be easy. https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/simple_udf.rs