ENH: allow using Arrow writing in pyogrio.write_dataframe (use_arrow=True option)

jorisvandenbossche commented 2 months ago

Follow-up on https://github.com/geopandas/pyogrio/pull/346 to expose the Arrow-based writing through geopandas-based write_dataframe as well.

Updated our geopandas writing tests to be parametrized on use_arrow for writing as well. This didn't catch any issues with the write_arrow itself, but we needed to do some pre-processing to handle the promotion to multi types consistently (as in the normal write path, this happens per geometry inside ogr_write)

There are a few things that don't work with Arrow:

No support (yet) for writing without a geometry column
~~No support (yet) for writing an all-missing geometry column (the conversion from pandas to Arrow then doesn't result in the correct type, something that could be fixed in the pre-processing)~~ -> add that in the preprocessing
No support for mixed type columns (that's an inherent limitation of Arrow, although in theory we could cast all values to string first, as that is also what the non-Arrow path eventually does when passing the values to GDAL)

jorisvandenbossche commented 2 months ago

It would be nice to somehow avoid repeated traversals over the GeoDataFrame to standardize the geometries to their multi equivalents (under promote_to_multi option), but I'm not seeing anything obvious. I wonder if this would be a good function to consider adding to shapely to do a single traversal over the geometry array and promote each type to its multi equivalent...

Yes, this is currently lacking a proper helper in shapely, see https://github.com/shapely/shapely/issues/1617

jorisvandenbossche commented 2 months ago

One remaining review comment is the one about writing the index or not: https://github.com/geopandas/pyogrio/pull/395#discussion_r1580972807

For consistency, I should probably drop the index here too? Although we should maybe reconsider that for the default write_dataframe code path as well, to actually write the index of the dataframe as a field, if not a RangeIndex?

For the use in geopandas itself, it doesn't matter, because it has its own logic (it will reset the index to a column to ensure it gets written)

theroggy commented 2 months ago

One remaining review comment is the one about writing the index or not: #395 (comment)

For consistency, I should probably drop the index here too? Although we should maybe reconsider that for the default write_dataframe code path as well, to actually write the index of the dataframe as a field, if not a RangeIndex?

For the use in geopandas itself, it doesn't matter, because it has its own logic (it will reset the index to a column to ensure it gets written)

I don't have an explicit preference either way, but it indeed should be consistent...

brendan-ward commented 2 months ago

One remaining review comment is the one about writing the index or not: https://github.com/geopandas/pyogrio/pull/395#discussion_r1580972807

my preference would be to drop it to be consistent, and revisit both of those in the future if needed, and possibly do so on the GeoPandas side rather than here. There are a number of issues to consider if we do make it a column: how to handle name collisions, name truncation (e.g., shapefile), multi-indexes, and round-tripping as an index.

jorisvandenbossche commented 2 months ago

OK, dropped the index here, and also added an explicit test for it (other tests will also fail in case the index would get preserved, but good to have an explicit test about it)

revisit both of those in the future if needed, and possibly do so on the GeoPandas side rather than here.

Note GeoPandas already has considered this (they explicitly reset the index to a column in order to preserve it, if the index has a name and is not a RangeIndex)

brendan-ward commented 2 months ago

Thanks @jorisvandenbossche !

brendan-ward commented 2 months ago

@jorisvandenbossche it looks like testing the sdist failed after merging this: https://github.com/geopandas/pyogrio/actions/runs/8914593210/job/24482500369

jorisvandenbossche commented 2 months ago

Will take a look, I suppose a missing marker (although a bit strange it only fails there)

geopandas / pyogrio

ENH: allow using Arrow writing in pyogrio.write_dataframe (use_arrow=True option) #395