geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
269 stars 22 forks source link

Yield batches from ogr_read_arrow #205

Closed kylebarron closed 1 year ago

kylebarron commented 1 year ago

I'm playing around more with reading arrow tables from pyogrio and it's really exciting. It does feel like having some API to yield batches would be helpful to work with larger datasets. @jorisvandenbossche wrote in https://github.com/geopandas/pyogrio/pull/155:

Returning a materialized Table (from read_all()) is fine for our current use case, but I can imagine that in the future, we might want to expose an iterative RecordBatchReader as well (eg as batch-wise input to query engine). When we want to do that, I assume that we somehow need to keep the GDAL Dataset alive (putting it in a Python object (wrapping in a small class, or putting in a PyCapsule with destructor), and keeping a reference to that object from the RecordBatchReader).

I've never touched ogr bindings before, but naively it seems the easiest way to do this is by using a context manager:

with open_arrow("file.shp") as reader:
    for record_batch in reader:
        table

would that work? just putting a yield here?

brendan-ward commented 1 year ago

Do you envision that there would be a counterpart function to write data in batches, via the Arrow I/O API (once available)?

kylebarron commented 1 year ago

That seems entirely dependent on GDAL? https://gdal.org/development/rfc/rfc86_column_oriented_api.html says

Potential future work, not in the scope of this RFC, could be the addition of a column-oriented method to write new features, a WriteRecordBatch() method.

I would of course love for that to be added to GDAL, and getting greater adoption of RFC 86 seems very helpful for that.

https://github.com/geopandas/pyogrio/pull/206 appeared to work on my local machine 🤷‍♂️. If a maintainer could enable CI on that PR that would be helpful!

kylebarron commented 1 year ago

If GDAL were to add it, I think a similar API like

with write_ogr_batches("file.gpkg", arrow_schema) as writer:
    writer.write_batch(batch)

could make sense. But for my own needs I think I'm more likely to only write to geoparquet and thus not use OGR as much for writing

jorisvandenbossche commented 1 year ago

@kylebarron thanks for looking into this! That seems like a really nice idea, I didn't think about a contextmanager to keep the dataset alive.

jorisvandenbossche commented 1 year ago

This was closed by #206