geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
257 stars 21 forks source link

ENH: allow using Arrow writing in pyogrio.write_dataframe (use_arrow=True option) #395

Closed jorisvandenbossche closed 2 months ago

jorisvandenbossche commented 2 months ago

Follow-up on https://github.com/geopandas/pyogrio/pull/346 to expose the Arrow-based writing through geopandas-based write_dataframe as well.

Updated our geopandas writing tests to be parametrized on use_arrow for writing as well. This didn't catch any issues with the write_arrow itself, but we needed to do some pre-processing to handle the promotion to multi types consistently (as in the normal write path, this happens per geometry inside ogr_write)

There are a few things that don't work with Arrow:

jorisvandenbossche commented 2 months ago

It would be nice to somehow avoid repeated traversals over the GeoDataFrame to standardize the geometries to their multi equivalents (under promote_to_multi option), but I'm not seeing anything obvious. I wonder if this would be a good function to consider adding to shapely to do a single traversal over the geometry array and promote each type to its multi equivalent...

Yes, this is currently lacking a proper helper in shapely, see https://github.com/shapely/shapely/issues/1617

jorisvandenbossche commented 2 months ago

One remaining review comment is the one about writing the index or not: https://github.com/geopandas/pyogrio/pull/395#discussion_r1580972807

For consistency, I should probably drop the index here too? Although we should maybe reconsider that for the default write_dataframe code path as well, to actually write the index of the dataframe as a field, if not a RangeIndex?

For the use in geopandas itself, it doesn't matter, because it has its own logic (it will reset the index to a column to ensure it gets written)

theroggy commented 2 months ago

One remaining review comment is the one about writing the index or not: #395 (comment)

For consistency, I should probably drop the index here too? Although we should maybe reconsider that for the default write_dataframe code path as well, to actually write the index of the dataframe as a field, if not a RangeIndex?

For the use in geopandas itself, it doesn't matter, because it has its own logic (it will reset the index to a column to ensure it gets written)

I don't have an explicit preference either way, but it indeed should be consistent...

brendan-ward commented 2 months ago

One remaining review comment is the one about writing the index or not: https://github.com/geopandas/pyogrio/pull/395#discussion_r1580972807

my preference would be to drop it to be consistent, and revisit both of those in the future if needed, and possibly do so on the GeoPandas side rather than here. There are a number of issues to consider if we do make it a column: how to handle name collisions, name truncation (e.g., shapefile), multi-indexes, and round-tripping as an index.

jorisvandenbossche commented 2 months ago

OK, dropped the index here, and also added an explicit test for it (other tests will also fail in case the index would get preserved, but good to have an explicit test about it)

revisit both of those in the future if needed, and possibly do so on the GeoPandas side rather than here.

Note GeoPandas already has considered this (they explicitly reset the index to a column in order to preserve it, if the index has a name and is not a RangeIndex)

brendan-ward commented 2 months ago

Thanks @jorisvandenbossche !

brendan-ward commented 2 months ago

@jorisvandenbossche it looks like testing the sdist failed after merging this: https://github.com/geopandas/pyogrio/actions/runs/8914593210/job/24482500369

jorisvandenbossche commented 2 months ago

Will take a look, I suppose a missing marker (although a bit strange it only fails there)