apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
13.88k stars 3.38k forks source link

[Python] Conventions around PyCapsule Interface and choosing Array/Stream export #40648

Open kylebarron opened 3 months ago

kylebarron commented 3 months ago

Describe the usage question you have. Please include as many useful details as possible.

👋 I've been excited about the PyCapsule interface, and have been implementing it in my geoarrow-rust project. Every function call accepts any Arrow PyCapsule interface object, no matter its producer. It's really amazing!

Fundamentally, my question is whether the existence of methods on an object should allow for an inference of its storage type. That is, should it be possible to observe whether a producer object is chunked or not based on whether it exports __arrow_c_array__ or __arrow_c_stream__? I had been expecting yes, as pyarrow implements only the former on Array and RecordBatch and only the latter on ChunkedArray and Table (to my knowledge). But this question came up here, where nanoarrow implements both __arrow_c_array__ and __arrow_c_stream__

I'd argue that it's simpler to only define a single type of export method on a class and allow the consumer to convert to a different representation if they need. This communicates more information about how the existing data is already stored in memory. But in general I think it's really useful if the community is able to agree on a convention here, which will inform whether consumers can expect this invariant to hold or not.

Component(s)

Python

paleolimbot commented 3 months ago

nanoarrow implements both __arrow_c_array__ and __arrow_c_stream__

For reference, the PR implementing the nanoarrow.Array is https://github.com/apache/arrow-nanoarrow/pull/396 . It is basically a ChunkedArray and is currently the only planned user-facing Arrayish thing, although it's all very new (feel free to comment on that PR!). Basically, I found that maintaining both a chunked and a non-chunked pathway in geoarrow-pyarrow resulted in a lot of Python loops over chunks and I wanted to avoid forcing nanoarrow users to maintain two pathways. Many pyarrow methods might give you back a Array or a ChunkedArray; however, many ChunkedArrays only have one chunk. The whole thing is imperfect and a bit of a compromise.

Fundamentally, my question is whether the existence of methods on an object should allow for an inference of its storage type

My take on this is that as long as the object has an unambiguous interpretation as a contiguous array (or might have one, since it might take a loop over something that is not already Arrow data to figure this out), I think it's fine for __arrow_c_array__ to exist. As long as an object has an unambiguous interpretation as zero or more arrays (or might have one), I think __arrow_c_stream__ can exist. I don't see those as mutually exclusive...for me this is like pyarrow.array() returning either a ChunkedArray or an Array: it just doesn't know until it sees the input what type it needs to unambiguously represent it.

For something like an Array or RecordBatch (or something like a numpy array) that is definitely Arrow and is definitely contiguous, I am not sure what the benefit would be for __arrow_c_stream__ to exist and it is probably just confusing if it does.

There are other assumptions that can't be captured by the mere existence of either of those, like exactly how expensive it will be to call any one of those methods. In https://github.com/shapely/shapely/pull/1953 both are fairly expensive because the data are not Arrow yet. For a database driver, it might expensive to consume the stream because the data haven't arrived over the network yet.

The Python buffer protocol has a flags field to handle consumer requests along these lines (like a request for contiguous, rather than strided, memory) that could be used to disambiguate some of these cases if it turns out that disambiguating them is important. It is also careful to note that the existence of the buffer protocol implementation does not imply that attempting to get the buffer will succeed.

For consuming in nanoarrow, the current approach is to use __arrow_c_stream__ whenever possible since this has the fewest constraints (arrays need not be in memory yet, need not be contiguous, might not be fully consumed). Then it falls back on __arrow_c_array__. The entrypoint is nanoarrow.c_array_stream(), which will happily accept either (generates a length-one stream if needed).

kylebarron commented 3 months ago

Being able to infer the input structure also significantly helps static typing. For example, I have type hints that I'm writing for geoarrow-rust that include:

@overload
def centroid(input: ArrowArrayExportable) -> PointArray: ...
@overload
def centroid(input: ArrowStreamExportable) -> ChunkedPointArray: ...
def centroid(
    input: ArrowArrayExportable | ArrowStreamExportable,
) -> PointArray | ChunkedPointArray: ...

I'm not sure which overload a type checker would pick if the input object had both dunder methods. I suppose it would always return the union. But being able to use structural types in this way is quite useful for static type checking and IDE autocompletion, which are really sore spots right now with pyarrow.

kylebarron commented 3 months ago

for me this is like pyarrow.array() returning either a ChunkedArray or an Array: it just doesn't know until it sees the input what type it needs to unambiguously represent it.

Does pyarrow.array ever return a ChunkedArray? I tried to pass in a list of lists and it inferred a ListArray. I thought pyarrow.array only ever returned an Array and pyarrow.chunked_array only ever returned a ChunkedArray?

For something like an Array or RecordBatch (or something like a numpy array) that is definitely Arrow and is definitely contiguous, I am not sure what the benefit would be for __arrow_c_stream__ to exist and it is probably just confusing if it does.

So your argument is that Array should never have __arrow_c_stream__, but that ChunkedArray should have both __arrow_c_array__ and __arrow_c_stream__?

paleolimbot commented 3 months ago

Does pyarrow.array ever return a ChunkedArray?

import pyarrow as pa

type(pa.array(["a" * 2 ** 20 for _ in range(2**10)]))
#> pyarrow.lib.StringArray
type(pa.array(["a" * 2 ** 20 for _ in range(2**11)]))
#> pyarrow.lib.ChunkedArray

This is also true of the __arrow_array__ protocol.

I'm not sure which overload a type checker would pick if the input object had both dunder methods.

Would the typing hints be significantly different for a ChunkedPointArray vs a PointArray?

So your argument is that Array should never have arrow_c_stream, but that ChunkedArray should have both arrow_c_array and arrow_c_stream?

Maybe could is more like it. pyarrow has an Array class for a specifically contiguous array...nanoarrow doesn't at the moment (at least in a user-facing nicely typed sort of way).