High level framework for ergonomic UDF programming

sunng87 commented 1 month ago

What type of enhancement is this?

Refactor

What does the enhancement do?

The idea is to create a high level framework for UDF development (not UDAF), to remove boilerplate code, and improve ergonomic.

The core responsibility of this framework is to provide:

Automatic function argument extraction/coerce
Declarative validation
Document generation

Current status

At the moment, a typical implementation of UDF looks like this one: https://github.com/GreptimeTeam/greptimedb/blob/main/src/common/function/src/scalars/geo/h3.rs#L95

Basically we do following steps to generate the result vector:

Validate input columns: column size and length for each column
Initialise the result vector
Extract values by row from columns
Call the rust function and dealing with error if any
Fill the result vector

Desired state

Because every implementation has do these 1/2/3/5 steps. An ergonomic solution is to provide a declarative way to extract rust data types from column vectors, and the user simply focus on calling rust function. The implementation of UDF should be stateless, so until we have a real case, we don't need to provide any type of context for execution except the original FunctionContext.

Inspired by how axum designed its web handler. The API looks like


trait Extract {
    fn extract(v: Value) -> Option<Self>;

    fn validate(&self) -> Result<()>;
}

struct Coordinate(f64);

impl Extract for Coordinate {

    fn extract(v: Value) -> Option<Self> {
        ...
    }

    fn validate(&self) -> Result<()> {
        Ok(())
    }
}

struct Resolution(i8);

impl Extract for Coordinate {

    fn extract(v: Value) -> Option<Self> {
        ...
    }

    fn validate(&self) -> Result<()> {
        ensure!(self.0 >=0 && self.0 < 18)
    }
}

trait FunctionExt1: Function {
    type T0: Extract;
    fn call(_ctx: FunctionContext, arg0: T0) -> R;
}

trait FunctionExt2: Function {
    type T0: Extract;
    type T1: Extract;
    fn call(_ctx: FunctionContext, arg0: T0, arg1: T1) -> R;
}

trait FunctionExt3: Function {
    type T0: Extract;
    type T1: Extract;
    type T2: Extract;
    fn call(_ctx: FunctionContext, arg0: T0, arg1: T1, arg2: T2) -> R;
}

...

FunctionExtN will provide default implementation for Function::eval.

TODO: think about how to detail with R

Limitation

Extractor has to be defined manually: But most extractor can be shared when they don't have a particular meaning like this coordinate or resolution. In some cases, they can be generic numbers or strings.
Variadic-argument is not supported with this design

Documentation

Procedural macro is preferred in this case for two types of usage:

As a marker for compile-time tools to extract rust docstrings to some markdown files that can be hosted in our docs.greptime.com
As the code generation macro that generates a ~doc~ function to return docstring at runtime. So we can have SQL query statement like ~SHOW DOC function~ to return docstring.

Implementation challenges

No response

evenyag commented 1 month ago

@sunng87 We are going to remove the wrapper layer of our UDF/UDAF and use datafusion's UDF API in the future. Not sure if this issue can benefit from it.

sunng87 commented 1 month ago

@evenyag If we use datafusion's API, are we still using our own Vector as input?

evenyag commented 4 weeks ago

@evenyag If we use datafusion's API, are we still using our own Vector as input?

No, we process arrow's arrays directly. Writing a simple UDF should be easy. https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/simple_udf.rs

GreptimeTeam / greptimedb