[Python]: Support PyCapsule Interface Objects as input in more places

kylebarron commented 3 months ago

Describe the enhancement requested

Now that the PyCapsule Interface is starting to gain more traction (https://github.com/apache/arrow/issues/39195), I think it would be great if some of pyarrow's functional APIs accepted any PyCapsule Interface object, and not just pyarrow objects.

Do people have opinions on what functions should or should not check for these objects? I'd argue that file format writers should check for them, because it's only a couple lines of code, and the input stream will be fully iterated over regardless. E.g. looking at the Parquet writer: the high level API doesn't currently accept a RecordBatchReader either, so support for both can come at the same time.

from dataclasses import dataclass
from typing import Any

import pyarrow as pa
import pyarrow.parquet as pq

@dataclass
class ArrowCStream:
    obj: Any

    def __arrow_c_stream__(self, requested_schema=None):
        return self.obj.__arrow_c_stream__(requested_schema=requested_schema)

table = pa.table({"a": [1, 2, 3, 4]})
pq.write_table(table, "test.parquet")  # works

reader = pa.RecordBatchReader.from_stream(table)
pq.write_table(reader, "test.parquet")  # fails
pq.write_table(ArrowCStream(table), "test.parquet")  # fails

I'd argue that the writer should be generalized to accept any object with an __arrow_c_stream__ dunder, and to ensure the stream is not materialized as a table.

Component(s)

Python

jorisvandenbossche commented 2 months ago

Specifically for pq.write_table(), this might be a bit trickier (without consuming the stream) because this currently uses parquet::arrow::FileWriter::WriteTable which is explicitly requiring a table input. The FileWriter interface has support for writing record batches as well, so we could rewrite the code a bit to iterate over the batches of the stream (but at that point, should that be done in something called write_table?)

But in general, certainly +1 on more widely supporting the interface.

Some other possible areas:

The dataset API for writing. In this case, pyarrow.dataset.write_dataset already does accept a record batch reader, so this should be straightforward to extend
Compute functions from pyarrow.compute ? Those could certainly accept objects with __arrow_c_array__, and in theory also __arrow_c_stream__, but those will fully consume the stream and return a materialized result, so not sure if that will be expected? (although, if you know those functions, that is kind of expected, so maybe this just requires good documentation)
Many of the methods on the Array/RecordBatch/Table classes accept similar objects (e.g. arr.take(..)). Not sure if we want to make those work with interface objects as well. Although currently what we exactly support as input is a bit inconsistent (only strictly a pyarrow array, or also a numpy array, a list, anything array-like or any sequence or collection? So if we would harmonize that with some helper, then we could at once also easily add support for any arrow-array-like object)

jorisvandenbossche commented 2 months ago

Started with exploring write_dataset -> https://github.com/apache/arrow/pull/43771

kylebarron commented 2 months ago

That sounds awesome.

For reference in my own experiments in https://github.com/kylebarron/arro3, I created an ArrayReader class, essentially just a RecordBatchReader but generalized to yield generic Arrays. Then for example cast is overloaded. So if it sees an object with __arrow_c_array__ it will immediately return an arro3.Array with the result. If it sees an object with __arrow_c_stream__ it will create a new ArrayReader holding an iterator with the compute function. So it will lazily yield casted chunks.

apache / arrow

[Python]: Support PyCapsule Interface Objects as input in more places #43410

Describe the enhancement requested

Component(s)