[Python] Method to lazily read a collection of multiple Arrow IPC stream files

apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

Apache License 2.0

14.6k stars 3.54k forks source link

It's easy enough to create a record batch reader from a collection of multiple Arrow IPC stream files that have the same schema using code like this:

import pyarrow as pa
import glob

def get_schema(paths):
    with open(path, "rb") as file:
        reader = pa.ipc.open_stream(file)
        return reader.schema

def get_batches(paths):
    for path in paths:
        with pa.memory_map(path) as file:  # or use: open(path, "rb") 
            reader = pa.ipc.open_stream(file)
            for batch in reader:
                yield batch

paths = glob.glob("*.arrows")

reader = pa.ipc.RecordBatchStreamReader.from_batches(
    get_schema(paths),
    get_batches(paths)
)

I can confirm based on testing that this works lazily. It doesn't read any of the record batches into memory. To read the record batches into memory, you call reader.read_next_batch() or reader.read_all() after the above.

Reading the batches will typically be faster if you use open(path, "rb") instead of pa.memory_map(path) in the definition of get_batches, but the tradeoff is that it uses a lot more memory.

Regardless of this, it would be nice to have a method in PyArrow that expresses this more concisely.

apache / arrow

[Python] Method to lazily read a collection of multiple Arrow IPC stream files #44561

Describe the enhancement requested

Component(s)