apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.27k stars 1.18k forks source link

New Signature concept #7580

Open izveigor opened 1 year ago

izveigor commented 1 year ago

Is your feature request related to a problem or challenge?

Follow on https://github.com/apache/arrow-datafusion/issues/6559.

Argument quality:

Combine qualities:

Algebra

Kleene algebra {+, ·, *}

TypeSignature Kleene algebra
Variadic (uneq-def1 + uneq-def2 + ...)*
VariadicEqual (eq-undef)*
VariadicAny (uneq-undef)*
Uniform (uneq-def1 + uneq-def2 + ...)^n
Exact (uneq-def1 · uneq-def2 · ...)
Any (uneq-undef)^n

(+): present in the current version;
(-): not present in the current version;

Undefined

Kleene algebra/Quality eq-undef uneq-undef
(arg)* VariadicEqual (+) VariadicAny (+)
(arg)^n Equal (-) Any (+)

Definite

Kleene algebra/Quality eq-def uneq-def
(arg1 + arg2 + ...)* Variadic with single argument (+) Variadic with multiple arguments (+)
(arg1 + arg2 + ...)^n Uniform with single argument (+) Uniform with multiple arguments (+)
(arg1 · arg2 · ...) Exact with the same data type (+) Exact with different data types (+)

Meta Algebra

Kleene algebra {+, ·, *}

TypeSignature Kleene algebra Usage
OneOf (+) expr1 + expr2 + ... used with any TypeSignature if it makes sense
Concat (-) expr1 · expr2 · ... Exact, Uniform, Equal, Any (without Kleene closure)

Argument expansion

/// A function's nested argument.
pub enum TypeNestedArgument {
    /// `DataType::List`
    List,
    /// `DataType::Struct`
    Struct,
}
/// A function's type argument, which defines the function's valid data types.
pub enum TypeArgument {
    /// Supports arbitrarily (with any dimensions) nested data types (like `DataType::List`)
    /// `base_types`: base data types for nested type
    /// `include_nested`: include inner nested datatypes (all defined nested data types are arbitrary)
    ///
    /// # Examples:
    /// `TypeArgument::List { arg: TypeNestedArgument::List, base_types: vec![DataType::Int8, DataType::UInt8], include_nested: vec![TypeNestedArgument::Struct] }` can accept
    /// data types: `DataType::List(DataType::UInt8)`, `DataType::List(DataType::Int8)`,
    /// `DataType::List(DataType::Struct(DataType::Int8))`, `DataType::List(DataType::Struct(DataType::Struct(DataType::UInt8)))` and so on.
    Nested {
        arg: TypeNestedArgument,
        base_types: Vec<DataType>,
        include_nested: Vec<TypeNestedArgument>,
    },
    /// Supports non nested data types
    Common(Vec<DataType>),
}

Variadic

Input:

Variadic(
    vec![
        TypeArgument::Nested {
            arg: TypeNestedArgument::List,
            base_types: vec![DataType::Int8, DataType::UInt8],
            include_nested: vec![],
        },
        TypeArgument::Common(vec![DataType::Int8, DataType::UInt8]),
        TypeArgument::Nested {
            arg: TypeNestedArgument::Struct,
            base_types: vec![DataType::Int8, DataType::UInt8],
            include_nested: vec![],
        }
    ]
)

Output:

(DataType::List(DataType::Int8 + DataType::UInt8 + DataType::List(recursive))
+ DataType::Int8 + DataType::UInt8 + DataType::Struct(DataType::Int8 + DataType::UInt8 + DataType::Struct(recursive)))*

Uniform

Input:

Uniform(
    n,
    vec![
        TypeArgument::Nested {
            arg: TypeNestedArgument::List,
            base_types: vec![DataType::Int8, DataType::UInt8],
            include_nested: vec![],
        },
        TypeArgument::Common(vec![DataType::Int8, DataType::UInt8]),
        TypeArgument::Struct{
            arg: TypeNestedArgument::Struct,
            base_types: vec![DataType::Int8, DataType::UInt8],
            include_nested: vec![],
        }
    ]
)

Output:

(DataType::List(DataType::Int8 + DataType::UInt8 + DataType::List(recursive)) + DataType::Int8 + DataType::UInt8 + DataType::Struct(DataType::Int8 + DataType::UInt8 + DataType::Struct(recursive))) ^ n

Exact

Input:

Exact(
    vec![
        TypeArgument::Nested{
            arg: TypeNestedArgument::List,
            base_types: vec![DataType::Int8, DataType::UInt8],
            include_nested: vec![],
        },
        TypeArgument::Common(vec![DataType::Int8, DataType::UInt8]),
        TypeArgument::Nested{
            arg: TypeNestedArgument::Struct,
            base_types: vec![DataType::Int8, DataType::UInt8],
            include_nested: vec![]
        }
    ]
)

Output:

DataType::List(DataType::Int8 + DataType::UInt8 + DataType::List(recursive)) |
DataType::Int8 |
DataType::UInt8 |
DataType::Struct(DataType::Int8 + DataType::UInt8 + DataType::Struct(recursive))

Proposed code for future features:

            BuiltinScalarFunction::ArrayAppend
            | BuiltinScalarFunction::ArrayPositions
            | BuiltinScalarFunction::ArrayRemove
            | BuiltinScalarFunction::ArrayRemoveAll
            | BuiltinScalarFunction::ArrayHas => Signature::concat(
                vec![
                    Exact(vec![TypeArgument::Nested {
                        arg: TypeNestedArgument::List,
                        base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        include_nested: vec![],
                    }]),
                    Uniform(1, vec![
                        TypeArgument::Common(
                            array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        ),
                        TypeArgument::Nested {
                            arg: TypeNestedArgument::List,
                            base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                            include_nested: vec![],
                        },
                    ]),
                ],
                self.volatility(),
            ),
            BuiltinScalarFunction::ArrayPopBack
            | BuiltinScalarFunction::ArrayDims
            | BuiltinScalarFunction::ArrayEmpty
            | BuiltinScalarFunction::Flatten
            | BuiltinScalarFunction::ArrayNdims
            | BuiltinScalarFunction::Cardinality => Signature::exact(
                vec![TypeArgument::Nested {
                    arg: TypeNestedArgument::List,
                    base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                    include_nested: vec![],
                }],
                self.volatility(),
            ),
            BuiltinScalarFunction::ArrayConcat => Signature::variadic(
                vec![TypeArgument::Nested {
                    arg: TypeNestedArgument::List,
                    base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                    include_nested: vec![],
                }],
                self.volatility(),
            ),
            BuiltinScalarFunction::ArrayHasAll | BuiltinScalarFunction::ArrayHasAny => {
                Signature::exact(
                    vec![
                        TypeArgument::Nested {
                            arg: TypeNestedArgument::List,
                            base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                            include_nested: vec![],
                        },
                        TypeArgument::Nested {
                            arg: TypeNestedArgument::List,
                            base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                            include_nested: vec![],
                        },
                    ],
                    self.volatility(),
                )
            }
            BuiltinScalarFunction::ArrayElement => Signature::exact(
                vec![
                    TypeArgument::Nested {
                        arg: TypeNestedArgument::List,
                        base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        include_nested: vec![],
                    },
                    TypeArgument::Common(vec![DataType::Int64]),
                ],
                self.volatility(),
            ),
            BuiltinScalarFunction::ArraySlice => Signature::exact(
                vec![
                    TypeArgument::Nested {
                        arg: TypeNestedArgument::List,
                        base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        include_nested: vec![],
                    },
                    TypeArgument::Common(vec![DataType::Int64, DataType::Int64]),
                ],
                self.volatility(),
            ),
            BuiltinScalarFunction::ArrayLength => Signature::one_of(
                vec![
                    Exact(vec![TypeArgument::Nested {
                        arg: TypeNestedArgument::List,
                        base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        include_nested: vec![],
                    }]),
                    Exact(vec![
                        TypeArgument::Nested {
                            arg: TypeNestedArgument::List,
                            base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                            include_nested: vec![],
                        },
                        TypeArgument::Common(vec![DataType::Int64]),
                    ]),
                ],
                self.volatility(),
            ),
            BuiltinScalarFunction::ArrayPosition => Signature::one_of(
                vec![Exact(vec![
                    TypeArgument::Nested {
                        arg: TypeNestedArgument::List,
                        base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        include_nested: vec![],
                    },
                    TypeArgument::Common(vec![DataType::Int64, DataType::Int64]),
                ])],
                self.volatility(),
            ),
            BuiltinScalarFunction::ArrayPrepend => Signature::concat(
                vec![
                    Uniform(1, vec![
                        TypeArgument::Common(
                            array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        ),
                        TypeArgument::Nested {
                            arg: TypeNestedArgument::List,
                            base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                            include_nested: vec![],
                        },
                    ]),
                    Exact(TypeArgument::Nested {
                        arg: TypeNestedArgument::List,
                        base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        include_nested: vec![],
                    }),
                ],
                self.volatility(),
            ),
            BuiltinScalarFunction::ArrayRepeat => Signature::concat(
                vec![
                    Uniform(1, vec![
                        TypeArgument::Common(
                            array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        ),
                        TypeArgument::Nested {
                            arg: TypeNestedArgument::List,
                            base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                            include_nested: vec![],
                        },
                    ]),
                    Exact(TypeArgument::Common(vec![DataType::Int64])),
                ],
                self.volatility(),
            ),
            BuiltinScalarFunction::ArrayRemoveN => Signature::concat(vec![
                Exact(vec![TypeArgument::Nested {
                    arg: TypeNestedArgument::List,
                    base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                    include_nested: vec![],
                }]),
                Uniform(
                1,
                    vec![
                    TypeArgument::Common(
                        array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                    ),
                    TypeArgument::Nested {
                        arg: TypeNestedArgument::List,
                        base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        include_nested: vec![],
                    },
                ]),
                Exact(vec![TypeArgument::Common(vec![DataType::Int64])]),
            ]),
            BuiltinScalarFunction::ArrayReplace
            | BuiltinScalarFunction::ArrayReplaceAll => Signature::concat(
                vec![
                    Exact(vec![TypeArgument::Nested {
                        arg: TypeNestedArgument::List,
                        base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        include_nested: vec![],
                    }]),
                    Uniform(
                    2,
                    vec![
                        TypeArgument::Common(
                            array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        ),
                        TypeArgument::Nested {
                            arg: TypeNestedArgument::List,
                            base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                            include_nested: vec![],
                        },
                    ]),
                ],
                self.volatility(),
            ),
            BuiltinScalarFunction::ArrayReplaceN => Signature::concat(
                vec![
                    Exact(vec![TypeArgument::Nested {
                        arg: TypeNestedArgument::List,
                        base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        include_nested: vec![],
                    }]),
                    Uniform(2, vec![
                        TypeArgument::Common(
                            array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        ),
                        TypeArgument::Nested {
                            arg: TypeNestedArgument::List,
                            base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                            include_nested: vec![],
                        },
                    ]),
                    Exact(vec![TypeArgument::Common(vec![DataType::Int64])]),
                ],
                self.volatility(),
            ),
            BuiltinScalarFunction::ArrayToString => Signature::concat(
                Exact(vec![TypeArgument::Nested {
                    arg: TypeNestedArgument::List,
                    base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                    include_nested: vec![],
                }]),
                OneOf(vec![
                    Exact(vec![TypeArgument::Common(vec![DataType::Utf8])]),
                    Exact(vec![TypeArgument::Common(vec![
                        DataType::Utf8,
                        DataType::Utf8,
                    ])]),
                ]),
            ),
            BuiltinScalarFunction::MakeArray => Signature::concat(
                vec![Variadic(vec![
                    TypeArgument::Nested {
                        arg: TypeNestedArgument::List,
                        base_types: array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                        include_nested: vec![],
                    },
                    TypeArgument::Common(
                        array_expressions::SUPPORTED_ARRAY_TYPES.to_vec(),
                    ),
                ])],
                self.volatility(),
            ),

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

crepererum commented 1 year ago

I agree that the signature system needs some work and I guess that you came up w/ a quite solid approach. However I am having some trouble following it :sweat_smile: . Some questions / comments:

izveigor commented 1 year ago

Hello, @crepererum! I'll try to explain as best I can.

About the categories:

There is a finite set of data types (which we denote by DataType) Any of the possible types (options) of Signature can be described into 4 categories:
definite (def) and undefined (undef);

Definition: a definite data type means that only a specific data type from the DataType set is suitable. (Example Exact(DataType::Int8) accepts (DataType::Int8), but not (DataType::UInt8)).

Definition: An undefined data type means that any data type from the set of DataTypes is suitable. (Example: Any() accepts (DataType::Int8), and also (DataType::UInt8)).

equal (eq) and unequal (uneq);

Definition of Equality: the category of equality means that all elements are equal to each other. (Example: VariadicEqual() can accept (DataType::Int8, DataType::Int8) or (DataType::UInt8, DataType::UInt8, DataType::UInt8), but not (DataType::Int8, DataType::UInt8)).

Definition of Inequality: The category of inequality means that the elements can be arbitrary. (Example: VariadicAny() can accept (DataType::Int8, DataType::Int8), and also (DataType::Int8, DataType::UInt8))

Now, combine categories:

About Kleene algebra:

I choosed Kleene algebra (which is used for regular expressions). So if we create the algorithm for Signature (i. e. for boolean function, which can accept either input data set or not). For out case it is sufficient to apply only regular language (like can accept Deterministic finite automaton (DFA)).

So, each type of signature represents a seperate DFA.

About meta algebra:

As exist situations, which a signature can check not only by one DFA, but many. So, we create the same regular meta language. For example, if we want to use two DFAs, and if one of it returns the positive answer, than input data set suits us (OneOf case).

For full compatibility, it is worth adding two new signature types (Equal и Concat).
Separately, it is worth mentioning the function Concat. Concat can accept only input data set (without Kleene star (only Equal, Any, Uniform and Exact)). Concat takes a set of data, divides them according to the size of each type of signature and uses the already specific DFA.

English is not the author’s native language, so there may be some difficulties in understanding. I hope you understood me and my idea seemed reasonable to you :)