Open ianmcook opened 2 weeks ago
It's easy enough to create a record batch reader from a collection of multiple Arrow IPC stream files that have the same schema using code like this:
import pyarrow as pa
import glob
def get_schema(paths):
with open(path, "rb") as file:
reader = pa.ipc.open_stream(file)
return reader.schema
def get_batches(paths):
for path in paths:
with pa.memory_map(path) as file: # or use: open(path, "rb")
reader = pa.ipc.open_stream(file)
for batch in reader:
yield batch
paths = glob.glob("*.arrows")
reader = pa.ipc.RecordBatchStreamReader.from_batches(
get_schema(paths),
get_batches(paths)
)
I can confirm based on testing that this works lazily. It doesn't read any of the record batches into memory. To read the record batches into memory, you call reader.read_next_batch()
or reader.read_all()
after the above.
Reading the batches will typically be faster if you use open(path, "rb")
instead of pa.memory_map(path)
in the definition of get_batches
, but the tradeoff is that it uses a lot more memory.
Regardless of this, it would be nice to have a method in PyArrow that expresses this more concisely.
Describe the enhancement requested
It would be nice to have some method available in PyArrow to lazily read a collection of Arrow IPC stream files. This would be a great complement to the Arrow over HTTP project, because a common use case is for the user to download multiple Arrow IPC stream files from the HTTP server and then read them into Python.
The dataset API works with files in the Arrow IPC file format, but it does not currently work with files in the Arrow IPC stream format.
Also, as far as I can tell, it is not currently possible to directly create a record batch stream reader from a collection of multiple Arrow IPC stream files with the same schema.
Component(s)
Python