Open edmondop opened 3 months ago
We should definitely explain them better in the doc. Here is my understanding. I'm wondering if anyone has additional thoughts or if I'm understanding something wrong.
PlanningError
if input arg is wrong. (It could be hard to specify the behavior, link to the actual checking code might help...)return_type()
, here we should return PlanningError
if input argument is wronginvoke()
...which should not be encouragedinvoke()
, when there is any expected error (e.g. something like divide by 0), explicitly return ExecutionError
, I think it's better not to directly return lower level error messages like ArrowError
, which might be hard to interpret sometimes.invoke()
if there are some branches that should be unreachable, throw InternalError
in such branches. If such a branch is executed there might be a potential bug, we want to fail fast in this case.I agree with what @2010YOUY01 says
Some additional comments about "Planning Error" are that for most UDFs that provide a signature DataFusion will ensure that the function is called with one of the valid types.
The UDF can assume the inputs are one of the valid combinations
If it wants to check and validate, it should return an internal_err if the planner provides arguments that don't match one of its proposed signatures
I had done some work in aligning the errors in the past, see #9164 for details
@2010YOUY01 just to confirm, for example in this case https://github.com/apache/datafusion/blob/a721be1b1d863b5b15a7a945c37ec051c449c46f/datafusion/functions-nested/src/resize.rs#L72
it should return a planner error, not an exec_error. Also the check on the length of the arguments should be done in return_type
and not in https://github.com/apache/datafusion/blob/a721be1b1d863b5b15a7a945c37ec051c449c46f/datafusion/functions-nested/src/resize.rs#L90
Is this correct?
@2010YOUY01 just to confirm, for example in this case
it should return a planner error, not an exec_error. Also the check on the length of the arguments should be done in
return_type
and not in https://github.com/apache/datafusion/blob/a721be1b1d863b5b15a7a945c37ec051c449c46f/datafusion/functions-nested/src/resize.rs#L90Is this correct?
I agree.
The first error can be triggered by bad input, so it's a planning error
> select array_resize('foo', 5, 0);
Execution error: Not reachable, data_type should be List, LargeList or FixedSizeList
The second one is redundant, but it's also okay to change it to internal_err!("Not reachable, arguments number check should already be done")
Ideally argument check should be handled by signature
not return_type
. Types should be coerced correctly when it reaches return_type
.
But before the signature
are mature enough, we somehow need to add the check (length, types) inside return_type
I had file an issue to improve on signature for coercion and length + type checking for functions https://github.com/apache/datafusion/issues/10507
Although I'm not sure does it really help or not since Signature::UserDefined
might be good enough to handle complex type signature already so I didn't push forward on this and move on other tasks instead
@jayzhan211 do you think we should start by adding an helper macro for pattern matching branches of UDF invoke
that should never happen?
udf_unreachable!()
that might create an InternalError
DataFusionError::Internal("This branch of the UDF should never be executed since pre-requisites are checked through type_signature and return_type")
IMO we should improve on signature instead of adding check in different function. This not only add unnecessary overhead and let the code less readable. And ideally we don't need any check in return_type
and invoke
IMO we should improve on signature instead of adding check in different function. This not only add unnecessary overhead and let the code less readable. And ideally we don't need any check in
return_type
andinvoke
Wouldn't inevitably be necessary if you do pattern matching to handle the branches that are impossible? (Because of signature enforcement)
Only if we use refined types it would be possible to avoid I think
Only if we use refined types it would be possible to avoid I think
I guess at least we could verified it with UserDefined
and check against all the edge case in coerce_types
Is your feature request related to a problem or challenge?
When writing a new UDF, a developer needs to decide how to perform error management in functions that return
Result
, such asreturn_type
andinvoke
. Looking at the existing codebase it is not obvious what are the error management best practicesErrors in return type
Regexp like uses an
plan_err
data types of args do not match the expected types https://github.com/apache/datafusion/blob/77311a5896272c7ed252d8cd53d48ec6ea7c0ccf/datafusion/functions/src/regex/regexplike.rs#L74 . Is this check redundant?Except expects two arguments but doesn't check the length, just the type https://github.com/apache/datafusion/blob/77311a5896272c7ed252d8cd53d48ec6ea7c0ccf/datafusion/functions-array/src/except.rs#L55.
Errors in invoke
In the resize function, there is a check on argument lengths in invoke which is not present in the
return_type
function https://github.com/apache/datafusion/blob/77311a5896272c7ed252d8cd53d48ec6ea7c0ccf/datafusion/functions-array/src/resize.rs#L27The same function also returns
exec_err
, andinternal_datafusion_err
Describe the solution you'd like
As a developers of custom UDF I would like to know:
Describe alternatives you've considered
No response
Additional context
No response