Closed kylebarron closed 1 year ago
Do you envision that there would be a counterpart function to write data in batches, via the Arrow I/O API (once available)?
That seems entirely dependent on GDAL? https://gdal.org/development/rfc/rfc86_column_oriented_api.html says
Potential future work, not in the scope of this RFC, could be the addition of a column-oriented method to write new features, a WriteRecordBatch() method.
I would of course love for that to be added to GDAL, and getting greater adoption of RFC 86 seems very helpful for that.
https://github.com/geopandas/pyogrio/pull/206 appeared to work on my local machine 🤷♂️. If a maintainer could enable CI on that PR that would be helpful!
If GDAL were to add it, I think a similar API like
with write_ogr_batches("file.gpkg", arrow_schema) as writer:
writer.write_batch(batch)
could make sense. But for my own needs I think I'm more likely to only write to geoparquet and thus not use OGR as much for writing
@kylebarron thanks for looking into this! That seems like a really nice idea, I didn't think about a contextmanager to keep the dataset alive.
This was closed by #206
I'm playing around more with reading arrow tables from pyogrio and it's really exciting. It does feel like having some API to yield batches would be helpful to work with larger datasets. @jorisvandenbossche wrote in https://github.com/geopandas/pyogrio/pull/155:
I've never touched ogr bindings before, but naively it seems the easiest way to do this is by using a context manager:
would that work? just putting a yield here?