apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.26k stars 1.18k forks source link

Expose inner field of struct within list-array #11419

Open jleibs opened 4 months ago

jleibs commented 4 months ago

Is your feature request related to a problem or challenge?

We frequently work with tables made up of "batch" data, which is in turn represented via structs.

For example:

+--------------------------------------+
| position2d                           |
+--------------------------------------+
| [{x: 1.0, y: 2.0}, {x: 3.0, y: 4.0}] |
| [{x: 5.0, y: 6.0}]                   |
+--------------------------------------+

I want to be able to restructure this so the inner fields of the struct array become their own columns:

+------------+------------+
| X          | Y          |
+------------+------------+
| [1.0, 3.0] | [2.0, 4.0] |
| [5.0]      | [6.0]      |
+------------+------------+

Describe the solution you'd like

I would like to be able to do this from SQL.

For example:

SELECT array_field(position2d, "x"), array_field(position2d, "Y")  FROM example;

Describe alternatives you've considered

This can be achieved via unnest and array_agg, but is somewhat painful to do so, requires the existence of a preserved row_id for group_by operation, and introduces uncertainty as to preservation of ordering. It does not appear that datafusion supports WITH ORDINALITY which would be used to orderwise guarantee ordering is maintained.

Example:

CREATE TABLE example AS
SELECT * FROM (
    VALUES
    (1, ARRAY[NAMED_STRUCT('x', 1.0, 'y', 2.0),
              NAMED_STRUCT('x', 3.0, 'y', 4.0)]),
    (2, ARRAY[NAMED_STRUCT('x', 5.0, 'y', 6.0)])
) as example(id, position2d);

select array_agg(p.x) as x, array_agg(p.y) as y
  FROM (select id, unnest(position2d) as p from example)
  GROUP BY id ORDER BY id;

Additional context

Structurally, the appropriate child array of the struct should be able to be used with the offset array from the list-array and I believe the "right thing" should happen. As such I believe this should generally be able to be implemented as a cheap operation along the lines of a cast.

jleibs commented 3 months ago

Implementing a proof-of-concept array_field for a single nested struct with a known type as my own UDF wasn't too horrible.

Would be nice to be able to do this in a generic and recursive fashion though.

alamb commented 3 months ago

DYI @duongcongtoai and @jayzhan211 who might have some pointers / suggestions

alamb commented 3 months ago

https://github.com/alamb/datafusion/blob/ea92ae72f7ec2e941d35aa077c6a39f74523ab63/datafusion/functions/src/core/getfield.rs#L141-L214 is how the current field access code works

jleibs commented 3 months ago

Here's the proof-of-concept I wrote to handle this for one level of struct field extraction: https://gist.github.com/jleibs/853a8f2eae2445d5bcdf9198e08ea6a0

jayzhan211 commented 3 months ago

I think it would be a plus If it could be extended from existing get_field function.

alamb commented 3 months ago

I think some of the json operators in https://github.com/datafusion-contrib/datafusion-functions-json might allow similar access patterns and could serve as inpiration