Open kylebarron opened 7 months ago
nanoarrow implements both
__arrow_c_array__
and__arrow_c_stream__
For reference, the PR implementing the nanoarrow.Array
is https://github.com/apache/arrow-nanoarrow/pull/396 . It is basically a ChunkedArray and is currently the only planned user-facing Arrayish thing, although it's all very new (feel free to comment on that PR!). Basically, I found that maintaining both a chunked and a non-chunked pathway in geoarrow-pyarrow resulted in a lot of Python loops over chunks and I wanted to avoid forcing nanoarrow users to maintain two pathways. Many pyarrow methods might give you back a Array
or a ChunkedArray
; however, many ChunkedArray
s only have one chunk. The whole thing is imperfect and a bit of a compromise.
Fundamentally, my question is whether the existence of methods on an object should allow for an inference of its storage type
My take on this is that as long as the object has an unambiguous interpretation as a contiguous array (or might have one, since it might take a loop over something that is not already Arrow data to figure this out), I think it's fine for __arrow_c_array__
to exist. As long as an object has an unambiguous interpretation as zero or more arrays (or might have one), I think __arrow_c_stream__
can exist. I don't see those as mutually exclusive...for me this is like pyarrow.array()
returning either a ChunkedArray
or an Array
: it just doesn't know until it sees the input what type it needs to unambiguously represent it.
For something like an Array
or RecordBatch
(or something like a numpy
array) that is definitely Arrow and is definitely contiguous, I am not sure what the benefit would be for __arrow_c_stream__
to exist and it is probably just confusing if it does.
There are other assumptions that can't be captured by the mere existence of either of those, like exactly how expensive it will be to call any one of those methods. In https://github.com/shapely/shapely/pull/1953 both are fairly expensive because the data are not Arrow yet. For a database driver, it might expensive to consume the stream because the data haven't arrived over the network yet.
The Python buffer protocol has a flags
field to handle consumer requests along these lines (like a request for contiguous, rather than strided, memory) that could be used to disambiguate some of these cases if it turns out that disambiguating them is important. It is also careful to note that the existence of the buffer protocol implementation does not imply that attempting to get the buffer will succeed.
For consuming in nanoarrow, the current approach is to use __arrow_c_stream__
whenever possible since this has the fewest constraints (arrays need not be in memory yet, need not be contiguous, might not be fully consumed). Then it falls back on __arrow_c_array__
. The entrypoint is nanoarrow.c_array_stream()
, which will happily accept either (generates a length-one stream if needed).
Being able to infer the input structure also significantly helps static typing. For example, I have type hints that I'm writing for geoarrow-rust that include:
@overload
def centroid(input: ArrowArrayExportable) -> PointArray: ...
@overload
def centroid(input: ArrowStreamExportable) -> ChunkedPointArray: ...
def centroid(
input: ArrowArrayExportable | ArrowStreamExportable,
) -> PointArray | ChunkedPointArray: ...
I'm not sure which overload a type checker would pick if the input object had both dunder methods. I suppose it would always return the union. But being able to use structural types in this way is quite useful for static type checking and IDE autocompletion, which are really sore spots right now with pyarrow.
for me this is like
pyarrow.array()
returning either aChunkedArray
or anArray
: it just doesn't know until it sees the input what type it needs to unambiguously represent it.
Does pyarrow.array
ever return a ChunkedArray
? I tried to pass in a list of lists and it inferred a ListArray
. I thought pyarrow.array
only ever returned an Array
and pyarrow.chunked_array
only ever returned a ChunkedArray
?
For something like an
Array
orRecordBatch
(or something like anumpy
array) that is definitely Arrow and is definitely contiguous, I am not sure what the benefit would be for__arrow_c_stream__
to exist and it is probably just confusing if it does.
So your argument is that Array
should never have __arrow_c_stream__
, but that ChunkedArray
should have both __arrow_c_array__
and __arrow_c_stream__
?
Does pyarrow.array ever return a ChunkedArray?
import pyarrow as pa
type(pa.array(["a" * 2 ** 20 for _ in range(2**10)]))
#> pyarrow.lib.StringArray
type(pa.array(["a" * 2 ** 20 for _ in range(2**11)]))
#> pyarrow.lib.ChunkedArray
This is also true of the __arrow_array__
protocol.
I'm not sure which overload a type checker would pick if the input object had both dunder methods.
Would the typing hints be significantly different for a ChunkedPointArray
vs a PointArray
?
So your argument is that Array should never have arrow_c_stream, but that ChunkedArray should have both arrow_c_array and arrow_c_stream?
Maybe could is more like it. pyarrow has an Array
class for a specifically contiguous array...nanoarrow doesn't at the moment (at least in a user-facing nicely typed sort of way).
Describe the usage question you have. Please include as many useful details as possible.
👋 I've been excited about the PyCapsule interface, and have been implementing it in my geoarrow-rust project. Every function call accepts any Arrow PyCapsule interface object, no matter its producer. It's really amazing!
Fundamentally, my question is whether the existence of methods on an object should allow for an inference of its storage type. That is, should it be possible to observe whether a producer object is chunked or not based on whether it exports
__arrow_c_array__
or__arrow_c_stream__
? I had been expecting yes, as pyarrow implements only the former onArray
andRecordBatch
and only the latter onChunkedArray
andTable
(to my knowledge). But this question came up here, where nanoarrow implements both__arrow_c_array__
and__arrow_c_stream__
I'd argue that it's simpler to only define a single type of export method on a class and allow the consumer to convert to a different representation if they need. This communicates more information about how the existing data is already stored in memory. But in general I think it's really useful if the community is able to agree on a convention here, which will inform whether consumers can expect this invariant to hold or not.
Component(s)
Python