Open kylebarron opened 3 months ago
Specifically for pq.write_table()
, this might be a bit trickier (without consuming the stream) because this currently uses parquet::arrow::FileWriter::WriteTable
which is explicitly requiring a table input. The FileWriter interface has support for writing record batches as well, so we could rewrite the code a bit to iterate over the batches of the stream (but at that point, should that be done in something called write_table
?)
But in general, certainly +1 on more widely supporting the interface.
Some other possible areas:
pyarrow.dataset.write_dataset
already does accept a record batch reader, so this should be straightforward to extendpyarrow.compute
? Those could certainly accept objects with __arrow_c_array__
, and in theory also __arrow_c_stream__
, but those will fully consume the stream and return a materialized result, so not sure if that will be expected? (although, if you know those functions, that is kind of expected, so maybe this just requires good documentation)arr.take(..)
). Not sure if we want to make those work with interface objects as well. Although currently what we exactly support as input is a bit inconsistent (only strictly a pyarrow array, or also a numpy array, a list, anything array-like or any sequence or collection? So if we would harmonize that with some helper, then we could at once also easily add support for any arrow-array-like object)Started with exploring write_dataset
-> https://github.com/apache/arrow/pull/43771
That sounds awesome.
For reference in my own experiments in https://github.com/kylebarron/arro3, I created an ArrayReader class, essentially just a RecordBatchReader
but generalized to yield generic Array
s. Then for example cast
is overloaded. So if it sees an object with __arrow_c_array__
it will immediately return an arro3.Array
with the result. If it sees an object with __arrow_c_stream__
it will create a new ArrayReader
holding an iterator with the compute function. So it will lazily yield casted chunks.
Describe the enhancement requested
Now that the PyCapsule Interface is starting to gain more traction (https://github.com/apache/arrow/issues/39195), I think it would be great if some of pyarrow's functional APIs accepted any PyCapsule Interface object, and not just pyarrow objects.
Do people have opinions on what functions should or should not check for these objects? I'd argue that file format writers should check for them, because it's only a couple lines of code, and the input stream will be fully iterated over regardless. E.g. looking at the Parquet writer: the high level API doesn't currently accept a
RecordBatchReader
either, so support for both can come at the same time.I'd argue that the writer should be generalized to accept any object with an
__arrow_c_stream__
dunder, and to ensure the stream is not materialized as a table.Component(s)
Python