apache / arrow-nanoarrow

Helpers for Arrow C Data & Arrow C Stream interfaces
https://arrow.apache.org/nanoarrow
Apache License 2.0
151 stars 34 forks source link

feat(python): Iterate over array buffers #433

Closed paleolimbot closed 2 months ago

paleolimbot commented 2 months ago

The idea with this change is to support efficient buffer access for chunked/streaming input (e.g., make a numpy array). The efficient implementation is compact but I am not sure it is easy to guess for anybody not familiar with nanoarrow internals:

with c_array_stream(obj, schema) as stream:
        for array in stream:
            view = array.view()

I'm not sure that iter_chun_data() is the best name here, but one would use it like:

import nanoarrow as na

array = na.Array([1, 2, 3], na.int32())

for view in array.iter_chunk_data():
    print(view.offset, view.length, list(view.buffers))
#> 0 3 [nanoarrow.c_lib.CBufferView(bool[0 b] ), nanoarrow.c_lib.CBufferView(int32[12 b] 1 2 3)]

This would replace iter_buffers() which is a little dangerous to use (since one might assume the whole buffer represents the array, where we really need the offset everywhere one might access a buffer). It also cleans up some of the ArrayViewIterator terminology (since an earlier version of this used the ArrayViewIterator instead of the simpler approach it now uses).

The below benchmark is engineered to find the point where a this iterator would be slower than pa.ChunkedArray.to_numpy() (for a million doubles in this specific example, PyArrow becomes faster between 100 and 1000 chunks).

import nanoarrow as na
from nanoarrow import c_lib
import pyarrow as pa
import numpy as np

n = int(1e6)
chunk_size = int(1e4)
num_chunks = n // chunk_size
n = chunk_size * num_chunks

chunks = [na.c_array(np.random.random(chunk_size)) for i in range(num_chunks)]
array = na.Array(c_lib.CArrayStream.from_array_list(chunks, na.c_schema(na.float64())))

def make():
    out = np.empty(len(array), dtype=np.float64)

    cursor = 0
    for view in array.iter_chunk_data():
        offset = view.offset
        length = view.length
        data = np.array(view.buffer(1), copy=False)[offset:(offset + length)]
        out[cursor:(cursor + length)] = np.array(data, copy=False)
        cursor += length

    return out

%timeit make()
#> 749 µs ± 37.8 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

chunked = pa.chunked_array([pa.array(item) for item in chunks])
%timeit chunked.to_numpy()
#> 2.07 ms ± 17.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# With 1000 chunks of size 1000, the number would be
# iter_array_view()
#> 3.02 ms ± 238 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# chunked.to_numpy()
#> 2.07 ms ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.testing.assert_equal(make(), chunked.to_numpy())
jorisvandenbossche commented 2 months ago

It's not super clear what the motivation is of this (it's also a bit hard to follow the differences between the different methods you are adding and what they actually return)

Why would a user need this? You can already iterate over the chunks of an Array, right?

for chunk in array.iter_chunks():
    buffers = chunk.buffers
    # or na.c_array(chunk) to get lower level access
    ...

Or is the problem you don't easily have access to the offset/length of a single CArray? (that are the things being used in your example above)

paleolimbot commented 2 months ago

This has been updated significantly from the earlier version!

Why would a user need this?

The motivation is a straightforward path to column-wise conversion (e.g., numpy, pandas). For small numbers of chunks this seems to be faster than pyarrow (although I'm sure we could fix that). With some knowledge of the internals you could also do:

for item in na.c_array_view(something):
    view = item.view()

...but that creates a new CArrayView each time and we technically don't need to (my quick check suggested this was about a 10% overhead, which might not matter since the Python per-chunk overhead is really the issue here). We could also just have iter_array_view() be what I pasted above.

paleolimbot commented 2 months ago

I updated a few more things here!

You can already iterate over the chunks of an Array, right?

Yes, but array.iter_chunks() is slow and na.c_array_view(array) + item.view() is not particularly obvious.

but that creates a new CArrayView each time and we technically don't need to

I reverted that part...if somebody did list(iter_array_views()) it would result in a very confusing result!