Open jleibs opened 4 months ago
Implementing a proof-of-concept array_field
for a single nested struct with a known type as my own UDF wasn't too horrible.
Would be nice to be able to do this in a generic and recursive fashion though.
DYI @duongcongtoai and @jayzhan211 who might have some pointers / suggestions
https://github.com/alamb/datafusion/blob/ea92ae72f7ec2e941d35aa077c6a39f74523ab63/datafusion/functions/src/core/getfield.rs#L141-L214 is how the current field access code works
Here's the proof-of-concept I wrote to handle this for one level of struct field extraction: https://gist.github.com/jleibs/853a8f2eae2445d5bcdf9198e08ea6a0
I think it would be a plus If it could be extended from existing get_field
function.
I think some of the json operators in https://github.com/datafusion-contrib/datafusion-functions-json might allow similar access patterns and could serve as inpiration
Is your feature request related to a problem or challenge?
We frequently work with tables made up of "batch" data, which is in turn represented via structs.
For example:
I want to be able to restructure this so the inner fields of the struct array become their own columns:
Describe the solution you'd like
I would like to be able to do this from SQL.
For example:
Describe alternatives you've considered
This can be achieved via unnest and array_agg, but is somewhat painful to do so, requires the existence of a preserved row_id for group_by operation, and introduces uncertainty as to preservation of ordering. It does not appear that datafusion supports
WITH ORDINALITY
which would be used to orderwise guarantee ordering is maintained.Example:
Additional context
Structurally, the appropriate child array of the struct should be able to be used with the offset array from the list-array and I believe the "right thing" should happen. As such I believe this should generally be able to be implemented as a cheap operation along the lines of a cast.