Open pvardanis opened 2 weeks ago
Thanks for reporting this @pvardanis. This might be expected as .values
accesses underlying storage which will start at the beginning of an array.
Could you say what version of pyarrow are you on? Also your array seems nested, which I'm not sure how to reproduce, can you give an example?
@rok I'm using 12.0.1
but can also reproduce the bug in 16.1.0
. This is how to create the array:
array_type = pa.list_(pa.list_(pa.int32(), list_size=2), list_size=2)
array = pa.array(
[[[1, 2], [3, 4]], [[5, 6], [7, 8]]],
type=array_type,
)
@pvardanis I can confirm this is still behavior on main branch. As per docs this is expected (but unintuitive):
values Return the underlying array of values which backs the FixedSizeListArray.
The reason this happens is that getting .values.values
will give you a zero-copy reference to the underlying storage and loose the offset of value whose values you were getting. This has utility in that we avoid a copy. But it is somewhat unintuitive.
.flatten
on the other hand creates a copy from the offset you're getting and returns the correct values.
You can work around this by either calculating your own offset and slice the .values.values
or by doing a array.take(i).values
(which will copy values at i
) or you can use .flatten
which will probably do the same as .take(i)
(I'm not sure about this, maybe flatten
copies the whole array).
@jorisvandenbossche would you say this is approximately right?
@rok
array.take([0]).values.values
seems to be working, however I was hoping for a zero copy solution.
array.slice(offset=0, length=1).values.values
still returns the whole array. Also, I don't want to use .flatten
since it gets rid of Null
values.
@pvardanis By calculating your own offset I mean something like this:
i = 1
array.values.values.slice(offset=i * array.type.list_size, length=array.type.list_size)
<pyarrow.lib.Int32Array object at 0x71226477b2e0>
[
3,
4
]
I think this should be zero-copy.
@rok that works thanks!
Great to hear! Closing the issue.
If there's more reports like this we should maybe consider some kind of .flattened_slice_view
method?
Also, I don't want to use .flatten since it gets rid of Null values.
@pvardanis watch out for the fact that null
lists can be "non-empty" in list
arrays (the values array might contain values for that slot).
And in fixed_size_list
arrays, the values in [i..i + fsl.list_size)
are not necessarily null
themselves. They could be anything.
We clarified the docstring of ListArray.values
a while ago to make it clear it ignores the offset: https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.values
We should do the same for FixedSizeListArray.values
, as that right now is less explicit: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListArray.html#pyarrow.FixedSizeListArray.values
BTW, an important reason why the .values
for a ListArray returns the non-offsetted child array is to ensure this still matches with the .offsets
of a sliced ListArray (the offsets array itself will be sliced as well, but the actual values in it don't change. So for those offset values to be correct indices into the values
array, that values
array should still be the full non-sliced version).
Of course, that reason is not important for FixedSizeListArray
given there are no offsets in this case, but it also makes sense to have .values
be consistent between both.
I am going to reopen this issue and rephrase the title as a documentation issue to clarify the docstring.
Of course, that reason is not important for
FixedSizeListArray
given there are no offsets in this case...
Technically the offsets exist the same way, but are implicit because they can be computed with (arr.offset + i) * list_size
.
Child arrays of nested layouts are required to keep the prefix-padding on the buffer when an upper level .offset > 0
to make it so that slicing an array, doesn't have to slice recursively. This makes the zero-copy .values
that @pvardanis wants to do, a bit trickier, but on the other hand makes slicing arrays constant-time instead of O(depth_of_the_layout)
@jorisvandenbossche perhaps *ListArray
and FixedSizeListArray
need a sliced_values()
accessor. For list-views sliced_values
would be hard to compute and explain to users though (because offsets are not necessarily sorted in list-views).
We clarified the docstring of
ListArray.values
a while ago to make it clear it ignores the offset: https://arrow.apache.org/docs/python/generated/pyarrow.ListArray.html#pyarrow.ListArray.valuesWe should do the same for
FixedSizeListArray.values
, as that right now is less explicit: https://arrow.apache.org/docs/python/generated/pyarrow.FixedSizeListArray.html#pyarrow.FixedSizeListArray.valuesOriginal report:
Describe the bug, including details regarding any error messages, version, and platform.
I have a
pa.FixedSizeListArray
that looks like this:I'm trying to flatten each row in the
pa.FixedSizeListArray
using.values
. Doingarray[0].values
, returns:but doing
array[0].values.values
returns the initial array flattened, not the one that I want:Weird thing is that
array[0].values.flatten()
returns what it's supposed to:However, I don't want to use
flatten()
since this discardsNull
values. Is this a bug or am I missing something?Component(s)
Python