apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.27k stars 1.19k forks source link

Improve consistency and documentation on error handling in in UDFs #11618

Open edmondop opened 3 months ago

edmondop commented 3 months ago

Is your feature request related to a problem or challenge?

When writing a new UDF, a developer needs to decide how to perform error management in functions that return Result, such as return_type and invoke. Looking at the existing codebase it is not obvious what are the error management best practices

Errors in return type

Errors in invoke

In the resize function, there is a check on argument lengths in invoke which is not present in the return_type function https://github.com/apache/datafusion/blob/77311a5896272c7ed252d8cd53d48ec6ea7c0ccf/datafusion/functions-array/src/resize.rs#L27

The same function also returns exec_err, and internal_datafusion_err

Describe the solution you'd like

As a developers of custom UDF I would like to know:

Describe alternatives you've considered

No response

Additional context

No response

2010YOUY01 commented 3 months ago

We should definitely explain them better in the doc. Here is my understanding. I'm wondering if anyone has additional thoughts or if I'm understanding something wrong.

Planning Error

  1. The planner will first check the argument type based on the function signature, and raise PlanningError if input arg is wrong. (It could be hard to specify the behavior, link to the actual checking code might help...)
  2. For some edge case planning can't do comprehensive arg type check, further argument check can be performed in return_type(), here we should return PlanningError if input argument is wrong
  3. It's possible there are some existing UDF implementations that hacked input argument checking logic into invoke()...which should not be encouraged

Execution Error

  1. Inside invoke(), when there is any expected error (e.g. something like divide by 0), explicitly return ExecutionError, I think it's better not to directly return lower level error messages like ArrowError, which might be hard to interpret sometimes.
  2. Inside invoke() if there are some branches that should be unreachable, throw InternalError in such branches. If such a branch is executed there might be a potential bug, we want to fail fast in this case.
alamb commented 3 months ago

I agree with what @2010YOUY01 says

Some additional comments about "Planning Error" are that for most UDFs that provide a signature DataFusion will ensure that the function is called with one of the valid types.

The UDF can assume the inputs are one of the valid combinations

If it wants to check and validate, it should return an internal_err if the planner provides arguments that don't match one of its proposed signatures

Omega359 commented 3 months ago

I had done some work in aligning the errors in the past, see #9164 for details

edmondop commented 3 months ago

@2010YOUY01 just to confirm, for example in this case https://github.com/apache/datafusion/blob/a721be1b1d863b5b15a7a945c37ec051c449c46f/datafusion/functions-nested/src/resize.rs#L72 it should return a planner error, not an exec_error. Also the check on the length of the arguments should be done in return_type and not in https://github.com/apache/datafusion/blob/a721be1b1d863b5b15a7a945c37ec051c449c46f/datafusion/functions-nested/src/resize.rs#L90

Is this correct?

2010YOUY01 commented 3 months ago

@2010YOUY01 just to confirm, for example in this case

https://github.com/apache/datafusion/blob/a721be1b1d863b5b15a7a945c37ec051c449c46f/datafusion/functions-nested/src/resize.rs#L72

it should return a planner error, not an exec_error. Also the check on the length of the arguments should be done in return_type and not in https://github.com/apache/datafusion/blob/a721be1b1d863b5b15a7a945c37ec051c449c46f/datafusion/functions-nested/src/resize.rs#L90

Is this correct?

I agree.

The first error can be triggered by bad input, so it's a planning error

> select array_resize('foo', 5, 0);
Execution error: Not reachable, data_type should be List, LargeList or FixedSizeList

The second one is redundant, but it's also okay to change it to internal_err!("Not reachable, arguments number check should already be done")

jayzhan211 commented 3 months ago

Ideally argument check should be handled by signature not return_type. Types should be coerced correctly when it reaches return_type.

But before the signature are mature enough, we somehow need to add the check (length, types) inside return_type

jayzhan211 commented 3 months ago

I had file an issue to improve on signature for coercion and length + type checking for functions https://github.com/apache/datafusion/issues/10507

Although I'm not sure does it really help or not since Signature::UserDefined might be good enough to handle complex type signature already so I didn't push forward on this and move on other tasks instead

edmondop commented 3 months ago

@jayzhan211 do you think we should start by adding an helper macro for pattern matching branches of UDF invoke that should never happen?

udf_unreachable!()

that might create an InternalError

DataFusionError::Internal("This branch of the UDF should never be executed since pre-requisites are checked through type_signature and return_type")
jayzhan211 commented 3 months ago

IMO we should improve on signature instead of adding check in different function. This not only add unnecessary overhead and let the code less readable. And ideally we don't need any check in return_type and invoke

edmondop commented 3 months ago

IMO we should improve on signature instead of adding check in different function. This not only add unnecessary overhead and let the code less readable. And ideally we don't need any check in return_type and invoke

Wouldn't inevitably be necessary if you do pattern matching to handle the branches that are impossible? (Because of signature enforcement)

Only if we use refined types it would be possible to avoid I think

jayzhan211 commented 3 months ago

Only if we use refined types it would be possible to avoid I think

I guess at least we could verified it with UserDefined and check against all the edge case in coerce_types