apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.6k stars 3.54k forks source link

[Python] Method to lazily read a collection of multiple Arrow IPC stream files #44561

Open ianmcook opened 2 weeks ago

ianmcook commented 2 weeks ago

Describe the enhancement requested

It would be nice to have some method available in PyArrow to lazily read a collection of Arrow IPC stream files. This would be a great complement to the Arrow over HTTP project, because a common use case is for the user to download multiple Arrow IPC stream files from the HTTP server and then read them into Python.

The dataset API works with files in the Arrow IPC file format, but it does not currently work with files in the Arrow IPC stream format.

Also, as far as I can tell, it is not currently possible to directly create a record batch stream reader from a collection of multiple Arrow IPC stream files with the same schema.

Component(s)

Python

ianmcook commented 2 weeks ago

It's easy enough to create a record batch reader from a collection of multiple Arrow IPC stream files that have the same schema using code like this:

import pyarrow as pa
import glob

def get_schema(paths):
    with open(path, "rb") as file:
        reader = pa.ipc.open_stream(file)
        return reader.schema

def get_batches(paths):
    for path in paths:
        with pa.memory_map(path) as file:  # or use: open(path, "rb") 
            reader = pa.ipc.open_stream(file)
            for batch in reader:
                yield batch

paths = glob.glob("*.arrows")

reader = pa.ipc.RecordBatchStreamReader.from_batches(
    get_schema(paths),
    get_batches(paths)
)

I can confirm based on testing that this works lazily. It doesn't read any of the record batches into memory. To read the record batches into memory, you call reader.read_next_batch() or reader.read_all() after the above.

Reading the batches will typically be faster if you use open(path, "rb") instead of pa.memory_map(path) in the definition of get_batches, but the tradeoff is that it uses a lot more memory.

Regardless of this, it would be nice to have a method in PyArrow that expresses this more concisely.