Write Arrow Table/RecordBatchReader to GDAL

kylebarron commented 5 months ago

A very WIP implementation for writing Arrow. This is mostly to start discussion and is largely inspired/copied from https://github.com/OSGeo/gdal/pull/9133.

Notes:

This uses the Arrow PyCapsule interface so that it can read from any object that exposes itself via Arrow. This means that it doesn't have a pyarrow dependency, and can read from pyarrow.Table, pyarrow.RecordBatchReader, and geoarrow-rs' GeoTable. The RecordBatchReader means that it can handle an iterator of batches, so they don't all have to be materialized at once in a Python Table.

Todo:

Figure out how to create the OGR Layer first
I'm sure my use of pointers is wildly incorrect.

References:

Closes https://github.com/geopandas/pyogrio/issues/314.

kylebarron commented 5 months ago

This now compiles on 3.8! I still need to figure out how to exclude this code from trying to compile on earlier GDAL versions, and I still need to test it.

jorisvandenbossche commented 5 months ago

I still need to figure out how to exclude this code from trying to compile on earlier GDAL versions, and I still need to test it.

Hmm, looking at how it is done on the reading side, it's only the declaration in pxd that we put behind an IF, and then in the actual code we just raise an error at the beginning of the function, but still compile it regardless of the GDAL version (wondering if cython does something smart here)

kylebarron commented 5 months ago

Yeah I was confused because I thought I was doing the same process as we have for reading

kylebarron commented 5 months ago

I was able to compile and install it locally with

virtualenv env
source ./env/bin/activate
pip install --no-cache-dir -e '.[dev,test,geopandas]'

but trying to use it with

import pyarrow as pa
from pyogrio._io import ogr_write_arrow

import geopandas as gpd
from geopandas.io.arrow import _geopandas_to_arrow
gdf = gpd.read_file(gpd.datasets.get_path('nybb'))
table = _geopandas_to_arrow(gdf)

new_schema = table.schema.set(4, pa.field("geometry", pa.binary(), True, {"ARROW:extension:name": "ogc.wkb"}))
new_table = pa.Table.from_batches(table.to_batches(), new_schema)

test = ogr_write_arrow('test.fgb', 'test', 'FlatGeobuf', new_table, 'MultiPolygon', {}, {})

is crashing Python. Does anyone have any suggestions for how to debug? I don't know where to go from here.

jorisvandenbossche commented 5 months ago

I fixed the segfault (gdb was pointing at exporting the schema, so stream.get_schema(stream, schema). My understanding is that we need to pass a schema as a pointer, but we need to have allocated it (and get_schema "fills it in"). So changed the declaration of schema and array a bit. That fixed the segfault. Then there was still an error in create_fields_from_arrow_schema always erroring because of checking for a wrong error code. But with that fixed, your example script runs without any error. The strange thing is that it only doesn't write any file .. Not fully sure what I am missing here.

jorisvandenbossche commented 5 months ago

Those annoying error return codes ... ;) We need to very carefully inspect every one while writing: the GDAL function OGR_L_WriteArrowBatch is returning a bool (true on success), but the ArrowArrayStream.get_next uses "0 on success, a non-zero error code otherwise.". While the code was doing it the other way around ..

Now it is writing actual content to the file. Just not for FlatGeobuf (that gives some error while writing), but for eg GPKG it is working.

I also opened https://github.com/geopandas/pyogrio/pull/351 to more easily integrate this with all options for setting up the layer/dataset.

kylebarron commented 4 months ago

I merged in https://github.com/geopandas/pyogrio/pull/351 which simplified this a lot.

I'm still struggling with my dev environment though. In particular getting

ModuleNotFoundError: No module named 'pyogrio._ogr'

when trying to import pyogrio.raw. This is after pip install '.[dev,test,geopandas]'.

jorisvandenbossche commented 4 months ago

Do you happen to do that import from within the project directory? Because in that case, you need to install it in editable mode (pip install -e .).

The issue is that when importing (or running tests) from the root of repo, python will find the pyogrio directory in the repo and try to import that, but the built extension module files are installed to site-packages. This is actually a reason to prefer a "src-layout", i.e. nesting the package directory in a src/ directory, but we haven't done that here (https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/)

kylebarron commented 4 months ago

Ah yes I've hit this so many times. It's also why compiled rust projects advise to use a different name for the folder containing python code.

kylebarron commented 4 months ago

I added some conditional compilation statements around other functions that were trying to use GDAL Arrow struct definitions

IF CTE_GDAL_VERSION >= (3, 8, 0):

so let's see if we can get it to compile with that

kylebarron commented 4 months ago

I'll try to add some tests, but the meat of the PR should be ready for review

kylebarron commented 4 months ago

It looks like the geometry is not being picked up correctly. E.g trying to write GeoJSON I see:

{ "type": "Feature", "properties": { "pop_est": 3856181, "continent": "Europe", "name": "Bosnia and Herz.", "iso_a3": "BIH", "gdp_md_est": 42530.0, "geometry": "0103000000010000001700000080C2F5285C8F32403033333333534540C0FFD3A7C7AC314090BAD7EFA783454030CE3DAB204C3140982FB2B021B9454080F5993A89EA3040E0D322EE77D54540F0F63671D9743040480BF3574705464000E525605A3D3040B82A7843F22C46404070F66A03802F403840268BCB684640809EEF3032EB2F40D04E9A65EC9D4640E00AE5B87251304090BB3A3987804640D01CF6C9F1883040B877F6F4159B4640802C6AA48C003140D04E9A65EC9D464090639DD79DDC314090D385B7AB884640300037719F8D3240A870BC87718A46403F244770670133404254F4291C6E464050C8242367013340E08442041C6E464050F2CD36375E33406039B4C8766E4640D0AE5FB01B1E3340A8F5622827364640007311DF8999334040F0BF95EC0446406062105839743340D86D3480B7C8454090853DEDF0373340182B6A300DC3454030EBE2361A083340F8449E245DB7454040478FDFDBB4324020CD58349D99454080C2F5285C8F32403033333333534540" }, "geometry": null },

I was hoping that the geometry column would automatically be picked up, but it looks like no. Presumably we need this section of the upstream GDAL implementation

kylebarron commented 4 months ago

With the latest changes, I require a geometry_name argument and it appears to now be correctly interpreting the geometry column. But the written file is still getting corrupted. Here the GeoJSON ends abruptly

{ "type": "Feature", "properties": { "pop_est": 1218208, "continent": "North America", "name": "Trinidad and Tobago", "iso_a3": "TTO", "gdp_md_est": 43570.0 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -61.68, 10.76 ], [ -61.105, 10.89 ], [ -60.895, 10.855 ], [ -60.935, 10.11 ], [ -61.77, 10.0 ], [ -61.95, 10.09 ], [ -61.66, 10.365 ], [ -61.68, 10.76 ] ] ] } },
{ "type": "Feature", "properties": { "pop_est": 13026129, "continent": "Africa", "name": "S. Sudan", "iso_a3": "SSD", "gdp_md_est": 20880.0 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 30.833852421715427, 3.509171604222463 ], [ 29.953500197069474, 4.173699042167684 ], [ 29.71599531425602, 4.600804755060153 ], [ 29.1590784034465, 4.389267279473231 ], [ 28.696677687298802, 4.455077215996937 ], [ 28.428993768026913, 4.287154649264494 ], [ 27.979977247842811, 4.408413397637375 ], [ 27.37422610851749, 5.233944403500061 ], [ 27.21340905122517, 5.550953477394557 ], [ 26.465909458123235, 5.94671743410187 ], [ 26.213418409945117, 6.546603298362072 ], [ 25.796647983511178, 6.979315904158071 ], [ 25.124130893664727, 7.500085150579437 ], [ 25.114932488716789, 7.825104071479174 ], [ 24.567369012152085, 8.229187933785468 ], [ 23.886979580860668, 8.619729712933065 ], [ 24.19406772118765, 8.728696472403897 ], [ 24.53741516360202, 8.91753756573172 ], [ 24.794925745412684, 9.810240916008695 ], [ 25.069603699343986, 10.273759963267992 ], [ 25.790633328413946, 10.411098940233728 ], [ 2%

I really don't know what's going on there.

Ideally we would also be able to infer the geometry column name from the presence of the geometry extension name, but I don't know how to port these lines.

jorisvandenbossche commented 4 months ago

Looking at how this is used in GDAL itself (the test from the PR you linked above), I first didn't understand how this works there without never setting any geometry field. But I assume that it works because the arrow table that emitted by GDAL itself uses the extension type that it then recognizes on input?

Anyway, looking a bit more at the documentation (https://gdal.org/tutorials/vector_api_tut.html#writing-to-ogr-using-the-arrow-c-data-interface and https://gdal.org/api/ogrlayer_cpp.html#_CPPv4N8OGRLayer15WriteArrowBatchEPK11ArrowSchemaP10ArrowArray12CSLConstList), it seems that we need to set the GEOMETRY_NAME option, if not creating the geometry field on the layer definition explicitly. The example of copy_layer in the WriteArrowBatch docs either explicitly creates the geometry fields with CreateGeomField if there are multiple geometry fields, or otherwise sets the GEOMETRY_NAME option. The docs indicate that it should fall back to recognize a default "wkb_geometry" named column, but that didn't seem to work for me (the created GeoJSON still had a "null" geometry, like in your output above)

Adding a small patch to set the option, now it works for any geometry column name. The GeoJSON output is for me also randomly cut off at some point creating a corrupted file. But that might be a bug on the GDAL side. Because writing to GPKG does create a file that correctly roundtrips.

jorisvandenbossche commented 4 months ago

The docs indicate that it should fall back to recognize a default "wkb_geometry" named column, but that didn't seem to work for me (the created GeoJSON still had a "null" geometry, like in your output above)

Ah, that's probably because I didn't specify the column name to geometry_name argument, and as a result the code here still creates a field with that name. And I suppose that if an attribute field exists with that name, it will not use that column anymore to write the geometry field.

But so with this patch

--- a/pyogrio/_io.pyx
+++ b/pyogrio/_io.pyx
@@ -1706,9 +1706,10 @@ def ogr_write_arrow(
         dataset_metadata, layer_metadata,
         &ogr_dataset, &ogr_layer,
     )
+    cdef char **options = dict_to_options({"GEOMETRY_NAME": geometry_name})

     stream_capsule = arrow_obj.__arrow_c_stream__()
-    return write_arrow_stream_capsule(ogr_layer, stream_capsule, geometry_name)
+    return write_arrow_stream_capsule(ogr_layer, stream_capsule, geometry_name, options)

 IF CTE_GDAL_VERSION >= (3, 8, 0):

I could successfully roundtrip to GPKG:

In [2]: gdf = pyogrio.read_dataframe(geopandas.datasets.get_path("naturalearth_lowres"))

In [3]: table = geopandas.io.arrow._geopandas_to_arrow(gdf)

In [6]: pyogrio.raw.write_arrow("test3.gpkg", table, geometry_type="MultiPolygon", geometry_name="geometry")
.. warnings ..

In [7]: pyogrio.read_dataframe("test3.gpkg")
Out[7]: 
         pop_est      continent                      name iso_a3  gdp_md_est                                           geometry
0       889953.0        Oceania                      Fiji    FJI        5496  MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1     58005463.0         Africa                  Tanzania    TZA       63177  POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2       603253.0         Africa                 W. Sahara    ESH         907  POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3     37589262.0  North America                    Canada    CAN     1736425  MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4    328239523.0  North America  United States of America    USA    21433226  MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
..           ...            ...                       ...    ...         ...                                                ...
172    6944975.0         Europe                    Serbia    SRB       51475  POLYGON ((18.82982 45.90887, 18.82984 45.90888...
173     622137.0         Europe                Montenegro    MNE        5542  POLYGON ((20.07070 42.58863, 19.80161 42.50009...
174    1794248.0         Europe                    Kosovo    -99        7926  POLYGON ((20.59025 41.85541, 20.52295 42.21787...
175    1394973.0  North America       Trinidad and Tobago    TTO       24269  POLYGON ((-61.68000 10.76000, -61.10500 10.890...
176   11062113.0         Africa                  S. Sudan    SSD       11998  POLYGON ((30.83385 3.50917, 29.95350 4.17370, ...

[177 rows x 6 columns]

jorisvandenbossche commented 4 months ago

Ideally we would also be able to infer the geometry column name from the presence of the geometry extension name, but I don't know how to port these lines.

I was going to say: we can do that on the python side where this should be simple, but of course I was thinking about a pyarrow table where we can easily check the schema. To make it generic for any object implementing the Arrow PyCapsule protocol, it indeed needs to be done similarly like the above. Will try to take a stab at this.

But so summarizing, we can probably do something along those lines:

Provide a geometry_name keyword for if you want to specify this explicitly as a user (eg if your column has a custom name / doesn't have geoarrow metadata set)
When looping over the schema's fields (in create_fields_from_arrow_schema to call OGR_L_CreateFieldFromArrowSchema), we try to infer if a column is a geometry column (with similar logic as the snippet from GDAL, i.e. checking the metadata). And when a column name is equal to the one passed as geometry_name or inferred as geometry type, we skip creating a field for it.
Pass the inferred/specified geometry column name to GDAL through the option (I suppose if we infer it, GDAL would also infer it, but just setting it anyway shouldn't hurt).
- Future enhancement can then be to support multiple geometry fields (but I think that not all formats support this, so it's still good to use the simpler option for the case of 1 geometry field)

jorisvandenbossche commented 4 months ago

Not the nicest piece of code, but it seems to be working ;)

(writing a table that was read with pyogrio.raw.read_arrow, and thus has a "ogc.wkb" column, works without specifying the geometry column name)

jorisvandenbossche commented 4 months ago

But the written file is still getting corrupted. Here the GeoJSON ends abruptly

I think the last commit should fix this, by ensuring we properly close the GDAL dataset object.

kylebarron commented 4 months ago

I think the last commit should fix this, by ensuring we properly close the GDAL dataset object.

Do you have an idea why CI is failing?

jorisvandenbossche commented 4 months ago

The latest run seems to be mostly passing, except for linting (I think we switched to ruff since the PR was started, so might have to merge in main to get those changes and run pre-commit), and except for on Mac build (but that's a flaky segfault that has been happening lately, independent from this PR, so something we should try to investigate and fix, but nothing to worry about for this PR)

kylebarron commented 4 months ago

Thanks for pushing on this; this is really exciting!

jorisvandenbossche commented 3 months ago

Pushed a few more changes:

In the write_arrow() signature, changed they first keywords order to data, path instead of path, data. That makes it consistent with the write_dataframe (not consistent with the "raw" write function, but in functionality it is closer to the geopandas-based version)
Added a docstring. Mostly copied from write_dataframe, just updating for the data keyword, and for geometry_type and geometry_name
Added a few more tests

jorisvandenbossche commented 2 months ago

@himikof @brendan-ward thanks a lot for the extensive review! Already addressed a few of the comments, will need to go through the bulk of them one of the coming days

jorisvandenbossche commented 2 months ago

Can you please add a test when crs is not passed in to verify the warning is raised?

If possible, would be good to have that also test (or in a separate test case) if GDAL auto detects CRS from metadata on the geometry column (per the TODO comment).

Added a test for not providing a crs. That directly tests the second aspect as well, because the table that we are writing in test was written with the result of read_arrow which will have this CRS metadata in its schema.

Can you please add a test for a driver that doesn't support write? See test_raw_io.py::test_write_unsupported.

It would probably also be good to probe at close errors on write, similar to test_raw_io.py::test_write_gdalclose_error.

It might be good to test other drivers similar to test_raw_io.py::test_write_supported, though it looks like only FGB would be particularly unique there.

If possible, please add tests for append capability supported / unsupported, similar to test_raw_io.py::test_write_append / test_raw_io.py::test_write_append_unsupported

Copied and adapted those tests from raw io (and fixed the append=True case, we were still incorrectly trying to add new fields to the existing layer). Only for geojson-like formats I get some error that I still have to report upstream.

Is the idea to add support to write_dataframe in a later PR? (would be a good idea, this is already getting pretty big)

Yes, that was my idea. I already have a branch locally that starts testing that (most errors are about the missing support for promoting the geometry type, but which is something for which we could add some custom shapely code in the geopandas -> arrow conversion), but indeed was planning to leave that for a separate PR ;)

kylebarron commented 2 months ago

Honestly it was mostly @jorisvandenbossche , but I'm happy to have got the ball rolling!

geopandas / pyogrio

Write Arrow Table/RecordBatchReader to GDAL #346