apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.9k stars 1.11k forks source link

Strengthen TypeSignature and Coercion rule. #10507

Open jayzhan211 opened 4 months ago

jayzhan211 commented 4 months ago

Is your feature request related to a problem or challenge?

Inspired from #10268. I have an idea to improve the current type signature and coercion design

What is the current status

Given the function arguments, we check the arguments with the defined TypeSignature. get_valid_types is the function that calculates the possible valid types based on TypeSignature. After we get all the possible valid types, we find the one of the valid types among all the possible valid types. The core coercion rule is coerced_from. If every type in the valid types is coercible, it is the one we take.

What is the issue of the current approach

Given the signature is not well-supported. We heavily rely on the coercion rule to get the expected types. We end up a complex coerce logic inside coerced_from function. It not only makes it hard to maintain (remove or change might cause the unknown issue to other functions), also contains duplicate (similar) logic to binary::coercion rule that is really confusing.
There are also cases that have coercion rule inside return_type of function which is not the expected place to fight with coercion.

How to fix this

I think it is possible to improve the design of TypeSignature so that we can find the one possible valid types given the current types. The valid types we get are already coercible, so we don't need coerced_from function anymore!

After the change we can eliminate coerced_from function and only the binary::coercion rule is remain.

Additional context

Problematic examples

array_concat has signature variadic any, we have the coercion rule inside return_type. nullif has coercion rule inside return_type

coerced_from has numeric coercion, list coercion, timestamp coercion, and even comparison_binary_numeric_coercion (which will be removed in #10268).

10268 is the first step! 🚀

Describe the solution you'd like

  1. Support / Improve TypeSignature so we can get the only possible valid types given the arguments types we have.
  2. Remove coerced_from function.

Describe alternatives you've considered

I assume it is possible to find the only valid types given the argument types. If it is a false statement, we need to find another solution.

Additional context

No response

jayzhan211 commented 1 month ago

The current ideal state in my mind

Signature does 3 things

  1. Length check
  2. Type check
  3. Coercion

For length, the common length check are

  1. Exact number
  2. Variadic (Any number)
  3. VariadicNonZero (Any number but at least one)
  4. VariadicEven (Less common, i.e. Map)

For types, We have two style, Exact and Coercion. Exact rejects if the type mismatch, Coercion rejects if teh type is not coercible to the target type.

The combination of these are

For non-uniform length and more then on data type signature, we could use UserDefined.

The more tricky part is the DataType. We have many functions expect Numeric type that includes integer, float, .... For function that expects string, there are Utf8, LargeUtf8, Utf8View.

For type checking, it would be nice to have more general Enum that includes more than one DataType to check against with.

enum ArgumentType {
 Numeric
 Integer
 Float
 List
 ...
}

Now, we have

TypeSignature::Numeric is one of the idea that comes out from it. For other kinds of complex type check or length check, we fall back to UserDefined