apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.54k stars 3.54k forks source link

[C++] Allow FieldPath to work with ListElement #20489

Closed asfimport closed 1 year ago

asfimport commented 2 years ago

FieldRef::FromDotPath can parse a single list element field. ie. {}'path.to.list[0]{`}but does not work in practice. Failing with:

_structfield: cannot subscript field of type list<....>

Being able to add a slice or multiple list elements is not within the scope of this issue. 

Reporter: Miles Granger / @milesgranger Assignee: Miles Granger / @milesgranger

PRs and other links:

Note: This issue was originally created as ARROW-18265. Please see the migration documentation for further details.

asfimport commented 2 years ago

Weston Pace / @westonpace: Note that the list_element compute function exists which should be able to do this. Unlike struct_field however, this function is binary instead of unary and takes the index as the second argument. In theory, that should be fine, as you can just pass a scalar / constant value as the second argument.

asfimport commented 2 years ago

Weston Pace / @westonpace: Is this just for fixed size lists? Or also for variable lists? If for variable lists then what should happen when the list isn't long enough to satisfy the index? I assume this would return null?

asfimport commented 2 years ago

Miles Granger / @milesgranger: Thanks @westonpace, that's true, upon initial implementation I am using a call to list_element so good to know I'm on the right path. :)

I'm aiming to support both variable and fixed size lists, and agree the case of a list element not being long enough, ought to return a null.

After updating the struct_field kernel to deal with lists, it "works" for a path specifying a 0 index, ie '.path.to.list[0].foo' but will fail if it's '.path.to.list[1].foo' because FieldRef::FindOne doesn't discover anything. Suspecting I need to investigate the FieldRef::FindAll visitor implementation, but not 100% sure right now.

I'll be sure to ping you for a review once I have something worth looking at. :-)

asfimport commented 1 year ago

Joris Van den Bossche / @jorisvandenbossche: @westonpace one aspect to explicitly call out: are we OK with the fact that an integer index element in a FieldPath means a "list element" selection? (or would you prefer to implement this differently?)

Because if we decide that an index means "list element selection" for ListTypes, that also means we can't use that to "select" the full (single) child field of a ListType. Mentioning this because you used field_ref(0) / field_ref("item") example as one option in ARROW-17820 how to reference the child field (for expresssion you want to apply an element-wise kernel on the child array of a ListArray).

And I don't think we can use field_ref(0) for both things.

asfimport commented 1 year ago

Weston Pace / @westonpace: Hmm, I am not sure I understand. I think field_ref(INTEGER) and field_ref(STRING) have always done very different things.


field(3) # Grab the 4th top-level field in the bound schema
field('x', 3) # Illegal, all args must be string if repeating
field('x', '3') # Grab the field named "x" in the current schema and then grab the field named "3"
field('x[3]') # Grab the 3rd item in the (list-type) array named "x"
field('x["3"]') # Grab the child array named "x" in the (struct-type) array named "x".
field(3, 3) # Not implemented [but could be], grab the 4th child field
             # of the 4th top-level (struct-type) field in the bound schema

Mentioning this because you used field_ref(0) / field_ref("item") example as one option in ARROW-17820 how to reference the child field (for expresssion you want to apply an element-wise kernel on the child array of a ListArray).

I think you are referring to this:

Perhaps the "map function" for List could be an expression bound to a schema of "{item: T}" (e.g. so you could do field_ref(0) or field_ref("item")).

I would expect it to look something like:


>>> full_schema = pa.schema([pa.field('myStruct', pa.struct([pa.field('myList', pa.int32())]))])
>>> full_schema
myStruct: struct<myList: int32>
  child 0, myList: int32

>>> selectMyList = field('myStruct', 'myList')
# This would also work in a theoretical world where we supported multi-integer refs
# selectMyList = field(0, 0)

# There are two expressions here.  The first expression, field('myStruct', 'myList') is
# bound to full_schema and would select the list array
#
# The second expression pc.field(0) is bound to pa.schema([pa.field('item', pa.int32())])
# and selects the 0-th field in the schema which is the int32 field.
>>> pc.applyMap(selectMyList, pc.field(0) * 2)
asfimport commented 1 year ago

Joris Van den Bossche / @jorisvandenbossche:

I think you are referring to this:

Indeed. But so in your code example above, the pc.field(0)* 2 is used to refer to the flat list values (the first child field). While the proposal in this issue is to have pc.field(0) mean to select the first element of each list value (eg in projections).
That seems quite a different meaning, and so I am not sure we should use the same API for both use cases.

asfimport commented 1 year ago

Weston Pace / @westonpace: I agree pc.field(0) should not mean to select the first element of each list value (eg in projections). I thought the proposal here was only for the form square-bracket -> numbers -> square bracket (e.g. [0])

asfimport commented 1 year ago

Joris Van den Bossche / @jorisvandenbossche: Yes, but the square-bracket form is basically the same as a single field path index, that is how it gets parsed (the top post could have been more explicit about this, instead of only mentioning FromDotPath, but also the resulting FieldPath).

Using a small workaround through StructFieldOptions to show this in python:


>>> pc.StructFieldOptions(".x[0]")
StructFieldOptions(field_ref=FieldRef.Nested(FieldRef.Name(x) FieldRef.FieldPath(0)))
>>> pc.StructFieldOptions(["x", 0])
StructFieldOptions(field_ref=FieldRef.Nested(FieldRef.Name(x) FieldRef.FieldPath(0)))

So those two give currently the same result. Essentially, at the moment a "[0]" element in a path gets interpreted as pc.field(0).

Given the above, AFAIU, we either agree that field_ref(0) for a list type means selecting the first element of each list (and thus not for selecting the flat values), or we either have to introduce a new concept in FieldRef/FieldPath to represent a list element selection (and thus also change how a "[0]" in a string path gets parsed to use this new object).

asfimport commented 1 year ago

Weston Pace / @westonpace: Ok. I think I understand now. The problem isn't how the user expresses it but more how we construct a FieldPath / FieldRef for one of these pseudo-refs. I don't think we rely on the child referencing behavior for variable length lists anywhere today so it's probably ok in that sense.

However, it would be a bit tricky if FindOne/FindAll ended up calling list_element (a compute function). FindOne/FindAll currently is in the core module and not the compute module. That being said we don't really distinguish between the two and I think have all but accepted they are co-mingled.

asfimport commented 1 year ago

Joris Van den Bossche / @jorisvandenbossche:

However, it would be a bit tricky if FindOne/FindAll ended up calling list_element (a compute function). FindOne/FindAll currently is in the core module and not the compute module.

That's a good point. So, for doing FindOne/FindAll itself, we won't need "list_element" kernel, since this returns a path, and for that we don't need to do the actual selection from the array. There is however a set of FieldPath::Get(..) methods that you can call on the resulting FieldPath. For "getting" the path from a schema or type, that's still OK (no need for the actual kernel), but there is a signature for Get() to apply the path on a record batch or array. For that specific case, one would actually need to use the kernel. Now, I am not sure we actually use this variant internally, though (outside of a few tests).

bkietz commented 1 year ago

FieldPath::Get(list_array) would return the child data for the list array, IE the undelimited list values. This is disallowed in scalar expressions because it results in a column whose length is entirely unrelated to the lengths of sibling columns. Consider projecting .s, .l[0] from Table.from_pydict({"s": ["a", "b"], "l": [[1,2,3], [4,5,6]]}); we can't do that just as we can't do Table.from_pydict({".s": ["a", "b"], ".l[0]": [1,2,3,4,5,6]}). It might be desirable to let pyarrow's field reference mechanism be more flexible (as in substrait) and implicitly replace a path like .l[0:1] with a call to list_slice or similar. However that should be a python API decision and should not propagate to the C++ library, where it's important that using FieldPath::Get be consistently performant