apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.03k stars 1.14k forks source link

Unified Function `Signature` type coercion handling for `Nulls` #12698

Open alamb opened 2 days ago

alamb commented 2 days ago

Is your feature request related to a problem or challenge?

The code in https://github.com/apache/datafusion/pull/12308 from @mesejo added explicit DataType::Null data type handling for type coercion for certain functions

However, this may have introduced a regression for handling Dictionary types https://github.com/apache/datafusion/issues/12670

Describe the solution you'd like

I would like the "normal" function signature resolution to handle Null coercion rather than requiring functions to provide a custom coerce method (as was done in https://github.com/apache/datafusion/pull/12308)

Comments from https://github.com/apache/datafusion/issues/12670#issuecomment-2385333981

Describe alternatives you've considered

Ideally I think the coercion logic should be able to substitute Null for any data type passed in

So given a signature like

                    Exact(vec![Utf8View, Utf8View]),

I would expect the coercion logic to be able to handle inputs like the following (by casting Null to Utf8View)

(Null, Null)
(Null, Utf8View) 
(Utf8View, Null)
(Utf8View, Utf8View)

Additional context

No response

jayzhan211 commented 2 days ago

We can introduce TypeSignature::String similar to Numeric one

findepi commented 1 day ago

Would that allow to avoid use of Signature::user_defined? if functions use user_defined, then same functionality will have to be implemented independently, and inconsistencies are inevitable. Can we make it a goal that not to require Signature::user_defined to implement useful things?

BTW this related to two topics

alamb commented 1 day ago

Can we make it a goal that not to require Signature::user_defined to implement useful things?

Yes, I think this should be the goal

The usecase for Signature::user_defined that I recall was to implement coercion rules for very specific functions (like coalesce)

We can introduce TypeSignature::String similar to Numeric one

I still don't fully understand why we can't just adjust the existing coercion rules to coerce Null to any needed type

findepi commented 1 day ago

Can we make it a goal that not to require Signature::user_defined to implement useful things?

Yes, I think this should be the goal

filed https://github.com/apache/datafusion/issues/12725 for this.

I still don't fully understand why we can't just adjust the existing coercion rules to coerce Null to any needed type

i would assume this is the case today. https://github.com/apache/datafusion/pull/12712 changes strpos to accept various string types (but not Null), yet it works for null input values too.

jayzhan211 commented 1 day ago

Would that allow to avoid use of Signature::user_defined? if functions use user_defined, then same functionality will have to be implemented independently, and inconsistencies are inevitable. Can we make it a goal that not to require Signature::user_defined to implement useful things?

BTW this related to two topics

Signature::user_defined is only used for special function ideally, it is used in many places because we don't have a nice signature system yet.

I have an old issue about the coercion rule and type signature https://github.com/apache/datafusion/issues/10507

I think the downside of coerce_from is that since all the coercion rule is in the one place, any change to it has unknown effect therefore hard to maintain, also it is not extensible. This is why we come out many other TypeSignature that avoid to go through the coerce_from.

Ideally we should have general signature like TypeSignature::Numeric or TypeSignature::String for most of the function. Only if the one that is too specific (only used in one function), we use TypeSignature::user_defined.