apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.7k stars 3.34k forks source link

[Python][Doc] Clarify docstring of FixedSizeListArray.values that it ignores the offset #41672

Open pvardanis opened 2 weeks ago

pvardanis commented 2 weeks ago

We clarified the docstring of ListArray.values a while ago to make it clear it ignores the offset: https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.values

We should do the same for FixedSizeListArray.values, as that right now is less explicit: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListArray.html#pyarrow.FixedSizeListArray.values


Original report:

Describe the bug, including details regarding any error messages, version, and platform.

I have a pa.FixedSizeListArray that looks like this:

<pyarrow.lib.FixedSizeListArray object at 0x13f9b45e0>
[
  [
    [
      1,
      2
    ],
    [
      3,
      4
    ]
  ],
  [
    [
      5,
      6
    ],
    [
      7,
      8
    ]
  ]
]

I'm trying to flatten each row in the pa.FixedSizeListArray using .values. Doing array[0].values, returns:

<pyarrow.lib.FixedSizeListArray object at 0x13f8e7fa0>
[
  [
    1,
    2
  ],
  [
    3,
    4
  ]
]

but doing array[0].values.values returns the initial array flattened, not the one that I want:

[
  1,
  2,
  3,
  4,
  5,
  6,
  7,
  8
]

Weird thing is that array[0].values.flatten() returns what it's supposed to:

[
  1,
  2,
  3,
  4
]

However, I don't want to use flatten() since this discards Null values. Is this a bug or am I missing something?

Component(s)

Python

rok commented 2 weeks ago

Thanks for reporting this @pvardanis. This might be expected as .values accesses underlying storage which will start at the beginning of an array. Could you say what version of pyarrow are you on? Also your array seems nested, which I'm not sure how to reproduce, can you give an example?

pvardanis commented 2 weeks ago

@rok I'm using 12.0.1 but can also reproduce the bug in 16.1.0. This is how to create the array:

array_type = pa.list_(pa.list_(pa.int32(), list_size=2), list_size=2)
array = pa.array(
        [[[1, 2], [3, 4]], [[5, 6], [7, 8]]],
        type=array_type,
    )
rok commented 2 weeks ago

@pvardanis I can confirm this is still behavior on main branch. As per docs this is expected (but unintuitive):

values Return the underlying array of values which backs the FixedSizeListArray.

The reason this happens is that getting .values.values will give you a zero-copy reference to the underlying storage and loose the offset of value whose values you were getting. This has utility in that we avoid a copy. But it is somewhat unintuitive. .flatten on the other hand creates a copy from the offset you're getting and returns the correct values. You can work around this by either calculating your own offset and slice the .values.values or by doing a array.take(i).values (which will copy values at i) or you can use .flatten which will probably do the same as .take(i) (I'm not sure about this, maybe flatten copies the whole array).

@jorisvandenbossche would you say this is approximately right?

pvardanis commented 2 weeks ago

@rok

array.take([0]).values.values

seems to be working, however I was hoping for a zero copy solution.

array.slice(offset=0, length=1).values.values

still returns the whole array. Also, I don't want to use .flatten since it gets rid of Null values.

rok commented 2 weeks ago

@pvardanis By calculating your own offset I mean something like this:

i = 1
array.values.values.slice(offset=i * array.type.list_size, length=array.type.list_size)
<pyarrow.lib.Int32Array object at 0x71226477b2e0>
[
  3,
  4
]

I think this should be zero-copy.

pvardanis commented 2 weeks ago

@rok that works thanks!

rok commented 2 weeks ago

Great to hear! Closing the issue. If there's more reports like this we should maybe consider some kind of .flattened_slice_view method?

felipecrv commented 2 weeks ago

Also, I don't want to use .flatten since it gets rid of Null values.

@pvardanis watch out for the fact that null lists can be "non-empty" in list arrays (the values array might contain values for that slot).

And in fixed_size_list arrays, the values in [i..i + fsl.list_size) are not necessarily null themselves. They could be anything.

jorisvandenbossche commented 2 weeks ago

We clarified the docstring of ListArray.values a while ago to make it clear it ignores the offset: https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.values

We should do the same for FixedSizeListArray.values, as that right now is less explicit: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListArray.html#pyarrow.FixedSizeListArray.values

BTW, an important reason why the .values for a ListArray returns the non-offsetted child array is to ensure this still matches with the .offsets of a sliced ListArray (the offsets array itself will be sliced as well, but the actual values in it don't change. So for those offset values to be correct indices into the values array, that values array should still be the full non-sliced version).

Of course, that reason is not important for FixedSizeListArray given there are no offsets in this case, but it also makes sense to have .values be consistent between both.

jorisvandenbossche commented 2 weeks ago

I am going to reopen this issue and rephrase the title as a documentation issue to clarify the docstring.

felipecrv commented 2 weeks ago

Of course, that reason is not important for FixedSizeListArray given there are no offsets in this case...

Technically the offsets exist the same way, but are implicit because they can be computed with (arr.offset + i) * list_size.

Child arrays of nested layouts are required to keep the prefix-padding on the buffer when an upper level .offset > 0 to make it so that slicing an array, doesn't have to slice recursively. This makes the zero-copy .values that @pvardanis wants to do, a bit trickier, but on the other hand makes slicing arrays constant-time instead of O(depth_of_the_layout)

felipecrv commented 2 weeks ago

@jorisvandenbossche perhaps *ListArray and FixedSizeListArray need a sliced_values() accessor. For list-views sliced_values would be hard to compute and explain to users though (because offsets are not necessarily sorted in list-views).