apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.49k stars 3.52k forks source link

[Python]: Support PyCapsule Interface Objects as input in more places #43410

Open kylebarron opened 3 months ago

kylebarron commented 3 months ago

Describe the enhancement requested

Now that the PyCapsule Interface is starting to gain more traction (https://github.com/apache/arrow/issues/39195), I think it would be great if some of pyarrow's functional APIs accepted any PyCapsule Interface object, and not just pyarrow objects.

Do people have opinions on what functions should or should not check for these objects? I'd argue that file format writers should check for them, because it's only a couple lines of code, and the input stream will be fully iterated over regardless. E.g. looking at the Parquet writer: the high level API doesn't currently accept a RecordBatchReader either, so support for both can come at the same time.

from dataclasses import dataclass
from typing import Any

import pyarrow as pa
import pyarrow.parquet as pq

@dataclass
class ArrowCStream:
    obj: Any

    def __arrow_c_stream__(self, requested_schema=None):
        return self.obj.__arrow_c_stream__(requested_schema=requested_schema)

table = pa.table({"a": [1, 2, 3, 4]})
pq.write_table(table, "test.parquet")  # works

reader = pa.RecordBatchReader.from_stream(table)
pq.write_table(reader, "test.parquet")  # fails
pq.write_table(ArrowCStream(table), "test.parquet")  # fails

I'd argue that the writer should be generalized to accept any object with an __arrow_c_stream__ dunder, and to ensure the stream is not materialized as a table.

Component(s)

Python

jorisvandenbossche commented 2 months ago

Specifically for pq.write_table(), this might be a bit trickier (without consuming the stream) because this currently uses parquet::arrow::FileWriter::WriteTable which is explicitly requiring a table input. The FileWriter interface has support for writing record batches as well, so we could rewrite the code a bit to iterate over the batches of the stream (but at that point, should that be done in something called write_table?)

But in general, certainly +1 on more widely supporting the interface.

Some other possible areas:

jorisvandenbossche commented 2 months ago

Started with exploring write_dataset -> https://github.com/apache/arrow/pull/43771

kylebarron commented 2 months ago

That sounds awesome.

For reference in my own experiments in https://github.com/kylebarron/arro3, I created an ArrayReader class, essentially just a RecordBatchReader but generalized to yield generic Arrays. Then for example cast is overloaded. So if it sees an object with __arrow_c_array__ it will immediately return an arro3.Array with the result. If it sees an object with __arrow_c_stream__ it will create a new ArrayReader holding an iterator with the compute function. So it will lazily yield casted chunks.