apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.15k stars 1.16k forks source link

Access children `DataType` or return-type in `ScalarUDFImpl::invoke` #12819

Open joseph-isaacs opened 1 week ago

joseph-isaacs commented 1 week ago

Is your feature request related to a problem or challenge?

I am trying to create a scalar UDF, pack, which operates on struct arrays. It packs many array into a struct array each with a distinct name

pack(("a", arr1), ("b", arr2), ...) -> struct([("a", arr1.data_type), ("b", arr2.data_type), ...])

This has a data type dependent on the input type and nullability. In the method ScalarUDFImpl::invoke I want to return an a struct array with each field having the data type and nullability of the input, however the invoke function only gives the data type of the array not the nullability of the record batch or intermediate children expressions.

I have returned this type information from return_type_from_exprs, I just need to access this in the stateless scalar udf impl.

Describe the solution you'd like

I would like add a new ScalarUDFImpl::invoke_with_data_type (or invoke_with_return_type) method which is given both the evaluated children array (as previously) and also either the previously returned type (from return_type_from_exprs) or the arguments already passed to return_type_from_exprs which could be re-evaluated by invoke. I am open to either, I guess the former seems more performant.

Describe alternatives you've considered

No response

Additional context

I believe this would be a small non-breaking, change, that I am happy to contribute.

Any ideas?

findepi commented 1 week ago

You may want to implement ScalarUDFImpl::simplify which is given nullability info. From there you'd return a new ScalarUDFImpl instance with the nullability information stored on a field. When the information is already stored, subsequent simplify call (if any), should return original expression. Let me know if it works for your case.

At some point we could probably rename simplify to specialize cc @alamb

alamb commented 1 week ago

The suggestion for using simplify is a good one 👍

joseph-isaacs commented 1 week ago

Hey, thanks for your ideas.

This mean that the ExprSimplifier would change the return type of the pack UDF.

Say for instance the return (before simplification) is ("a", int64 nullable), ... after would be ("a", int64 non nullable), ....

This would mean that the simplification is required to run for the UDF to behave correctly, since the non-specialized invoke doesn't know if a specific field is nullable or not and therefore possibly return a RecordBatch with different schema the return type or always return each field as nullable (not ideal).

joseph-isaacs commented 1 week ago

This would be the (non-breaking) change I am after