Closed paleolimbot closed 2 months ago
It's not super clear what the motivation is of this (it's also a bit hard to follow the differences between the different methods you are adding and what they actually return)
Why would a user need this? You can already iterate over the chunks of an Array, right?
for chunk in array.iter_chunks():
buffers = chunk.buffers
# or na.c_array(chunk) to get lower level access
...
Or is the problem you don't easily have access to the offset/length of a single CArray? (that are the things being used in your example above)
This has been updated significantly from the earlier version!
Why would a user need this?
The motivation is a straightforward path to column-wise conversion (e.g., numpy, pandas). For small numbers of chunks this seems to be faster than pyarrow (although I'm sure we could fix that). With some knowledge of the internals you could also do:
for item in na.c_array_view(something):
view = item.view()
...but that creates a new CArrayView
each time and we technically don't need to (my quick check suggested this was about a 10% overhead, which might not matter since the Python per-chunk overhead is really the issue here). We could also just have iter_array_view()
be what I pasted above.
I updated a few more things here!
You can already iterate over the chunks of an Array, right?
Yes, but array.iter_chunks()
is slow and na.c_array_view(array)
+ item.view()
is not particularly obvious.
but that creates a new CArrayView each time and we technically don't need to
I reverted that part...if somebody did list(iter_array_views())
it would result in a very confusing result!
The idea with this change is to support efficient buffer access for chunked/streaming input (e.g., make a numpy array). The efficient implementation is compact but I am not sure it is easy to guess for anybody not familiar with nanoarrow internals:
I'm not sure that
iter_chun_data()
is the best name here, but one would use it like:This would replace
iter_buffers()
which is a little dangerous to use (since one might assume the whole buffer represents the array, where we really need the offset everywhere one might access a buffer). It also cleans up some of theArrayViewIterator
terminology (since an earlier version of this used theArrayViewIterator
instead of the simpler approach it now uses).The below benchmark is engineered to find the point where a this iterator would be slower than
pa.ChunkedArray.to_numpy()
(for a million doubles in this specific example, PyArrow becomes faster between 100 and 1000 chunks).