Closed kylebarron closed 2 months ago
This now compiles on 3.8! I still need to figure out how to exclude this code from trying to compile on earlier GDAL versions, and I still need to test it.
I still need to figure out how to exclude this code from trying to compile on earlier GDAL versions, and I still need to test it.
Hmm, looking at how it is done on the reading side, it's only the declaration in pxd that we put behind an IF, and then in the actual code we just raise an error at the beginning of the function, but still compile it regardless of the GDAL version (wondering if cython does something smart here)
Yeah I was confused because I thought I was doing the same process as we have for reading
I was able to compile and install it locally with
virtualenv env
source ./env/bin/activate
pip install --no-cache-dir -e '.[dev,test,geopandas]'
but trying to use it with
import pyarrow as pa
from pyogrio._io import ogr_write_arrow
import geopandas as gpd
from geopandas.io.arrow import _geopandas_to_arrow
gdf = gpd.read_file(gpd.datasets.get_path('nybb'))
table = _geopandas_to_arrow(gdf)
new_schema = table.schema.set(4, pa.field("geometry", pa.binary(), True, {"ARROW:extension:name": "ogc.wkb"}))
new_table = pa.Table.from_batches(table.to_batches(), new_schema)
test = ogr_write_arrow('test.fgb', 'test', 'FlatGeobuf', new_table, 'MultiPolygon', {}, {})
is crashing Python. Does anyone have any suggestions for how to debug? I don't know where to go from here.
I fixed the segfault (gdb was pointing at exporting the schema, so stream.get_schema(stream, schema)
. My understanding is that we need to pass a schema as a pointer, but we need to have allocated it (and get_schema "fills it in"). So changed the declaration of schema and array a bit.
That fixed the segfault. Then there was still an error in create_fields_from_arrow_schema
always erroring because of checking for a wrong error code. But with that fixed, your example script runs without any error. The strange thing is that it only doesn't write any file .. Not fully sure what I am missing here.
Those annoying error return codes ... ;) We need to very carefully inspect every one while writing: the GDAL function OGR_L_WriteArrowBatch
is returning a bool (true on success), but the ArrowArrayStream.get_next
uses "0 on success, a non-zero error code otherwise.". While the code was doing it the other way around ..
Now it is writing actual content to the file. Just not for FlatGeobuf (that gives some error while writing), but for eg GPKG it is working.
I also opened https://github.com/geopandas/pyogrio/pull/351 to more easily integrate this with all options for setting up the layer/dataset.
I merged in https://github.com/geopandas/pyogrio/pull/351 which simplified this a lot.
I'm still struggling with my dev environment though. In particular getting
ModuleNotFoundError: No module named 'pyogrio._ogr'
when trying to import pyogrio.raw
. This is after pip install '.[dev,test,geopandas]'
.
Do you happen to do that import from within the project directory? Because in that case, you need to install it in editable mode (pip install -e .
).
The issue is that when importing (or running tests) from the root of repo, python will find the pyogrio
directory in the repo and try to import that, but the built extension module files are installed to site-packages.
This is actually a reason to prefer a "src-layout", i.e. nesting the package directory in a src/
directory, but we haven't done that here (https://packaging.python.org/en/latest/discussions/src-layout-vs-flat-layout/)
Ah yes I've hit this so many times. It's also why compiled rust projects advise to use a different name for the folder containing python code.
I added some conditional compilation statements around other functions that were trying to use GDAL Arrow struct definitions
IF CTE_GDAL_VERSION >= (3, 8, 0):
so let's see if we can get it to compile with that
I'll try to add some tests, but the meat of the PR should be ready for review
It looks like the geometry is not being picked up correctly. E.g trying to write GeoJSON I see:
{ "type": "Feature", "properties": { "pop_est": 3856181, "continent": "Europe", "name": "Bosnia and Herz.", "iso_a3": "BIH", "gdp_md_est": 42530.0, "geometry": "0103000000010000001700000080C2F5285C8F32403033333333534540C0FFD3A7C7AC314090BAD7EFA783454030CE3DAB204C3140982FB2B021B9454080F5993A89EA3040E0D322EE77D54540F0F63671D9743040480BF3574705464000E525605A3D3040B82A7843F22C46404070F66A03802F403840268BCB684640809EEF3032EB2F40D04E9A65EC9D4640E00AE5B87251304090BB3A3987804640D01CF6C9F1883040B877F6F4159B4640802C6AA48C003140D04E9A65EC9D464090639DD79DDC314090D385B7AB884640300037719F8D3240A870BC87718A46403F244770670133404254F4291C6E464050C8242367013340E08442041C6E464050F2CD36375E33406039B4C8766E4640D0AE5FB01B1E3340A8F5622827364640007311DF8999334040F0BF95EC0446406062105839743340D86D3480B7C8454090853DEDF0373340182B6A300DC3454030EBE2361A083340F8449E245DB7454040478FDFDBB4324020CD58349D99454080C2F5285C8F32403033333333534540" }, "geometry": null },
I was hoping that the geometry column would automatically be picked up, but it looks like no. Presumably we need this section of the upstream GDAL implementation
With the latest changes, I require a geometry_name
argument and it appears to now be correctly interpreting the geometry column. But the written file is still getting corrupted. Here the GeoJSON ends abruptly
{ "type": "Feature", "properties": { "pop_est": 1218208, "continent": "North America", "name": "Trinidad and Tobago", "iso_a3": "TTO", "gdp_md_est": 43570.0 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ -61.68, 10.76 ], [ -61.105, 10.89 ], [ -60.895, 10.855 ], [ -60.935, 10.11 ], [ -61.77, 10.0 ], [ -61.95, 10.09 ], [ -61.66, 10.365 ], [ -61.68, 10.76 ] ] ] } },
{ "type": "Feature", "properties": { "pop_est": 13026129, "continent": "Africa", "name": "S. Sudan", "iso_a3": "SSD", "gdp_md_est": 20880.0 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 30.833852421715427, 3.509171604222463 ], [ 29.953500197069474, 4.173699042167684 ], [ 29.71599531425602, 4.600804755060153 ], [ 29.1590784034465, 4.389267279473231 ], [ 28.696677687298802, 4.455077215996937 ], [ 28.428993768026913, 4.287154649264494 ], [ 27.979977247842811, 4.408413397637375 ], [ 27.37422610851749, 5.233944403500061 ], [ 27.21340905122517, 5.550953477394557 ], [ 26.465909458123235, 5.94671743410187 ], [ 26.213418409945117, 6.546603298362072 ], [ 25.796647983511178, 6.979315904158071 ], [ 25.124130893664727, 7.500085150579437 ], [ 25.114932488716789, 7.825104071479174 ], [ 24.567369012152085, 8.229187933785468 ], [ 23.886979580860668, 8.619729712933065 ], [ 24.19406772118765, 8.728696472403897 ], [ 24.53741516360202, 8.91753756573172 ], [ 24.794925745412684, 9.810240916008695 ], [ 25.069603699343986, 10.273759963267992 ], [ 25.790633328413946, 10.411098940233728 ], [ 2%
I really don't know what's going on there.
Ideally we would also be able to infer the geometry column name from the presence of the geometry extension name, but I don't know how to port these lines.
Looking at how this is used in GDAL itself (the test from the PR you linked above), I first didn't understand how this works there without never setting any geometry field. But I assume that it works because the arrow table that emitted by GDAL itself uses the extension type that it then recognizes on input?
Anyway, looking a bit more at the documentation (https://gdal.org/tutorials/vector_api_tut.html#writing-to-ogr-using-the-arrow-c-data-interface and https://gdal.org/api/ogrlayer_cpp.html#_CPPv4N8OGRLayer15WriteArrowBatchEPK11ArrowSchemaP10ArrowArray12CSLConstList), it seems that we need to set the GEOMETRY_NAME
option, if not creating the geometry field on the layer definition explicitly. The example of copy_layer
in the WriteArrowBatch docs either explicitly creates the geometry fields with CreateGeomField if there are multiple geometry fields, or otherwise sets the GEOMETRY_NAME
option.
The docs indicate that it should fall back to recognize a default "wkb_geometry" named column, but that didn't seem to work for me (the created GeoJSON still had a "null" geometry, like in your output above)
Adding a small patch to set the option, now it works for any geometry column name. The GeoJSON output is for me also randomly cut off at some point creating a corrupted file. But that might be a bug on the GDAL side. Because writing to GPKG does create a file that correctly roundtrips.
The docs indicate that it should fall back to recognize a default "wkb_geometry" named column, but that didn't seem to work for me (the created GeoJSON still had a "null" geometry, like in your output above)
Ah, that's probably because I didn't specify the column name to geometry_name
argument, and as a result the code here still creates a field with that name. And I suppose that if an attribute field exists with that name, it will not use that column anymore to write the geometry field.
But so with this patch
--- a/pyogrio/_io.pyx
+++ b/pyogrio/_io.pyx
@@ -1706,9 +1706,10 @@ def ogr_write_arrow(
dataset_metadata, layer_metadata,
&ogr_dataset, &ogr_layer,
)
+ cdef char **options = dict_to_options({"GEOMETRY_NAME": geometry_name})
stream_capsule = arrow_obj.__arrow_c_stream__()
- return write_arrow_stream_capsule(ogr_layer, stream_capsule, geometry_name)
+ return write_arrow_stream_capsule(ogr_layer, stream_capsule, geometry_name, options)
IF CTE_GDAL_VERSION >= (3, 8, 0):
I could successfully roundtrip to GPKG:
In [2]: gdf = pyogrio.read_dataframe(geopandas.datasets.get_path("naturalearth_lowres"))
In [3]: table = geopandas.io.arrow._geopandas_to_arrow(gdf)
In [6]: pyogrio.raw.write_arrow("test3.gpkg", table, geometry_type="MultiPolygon", geometry_name="geometry")
.. warnings ..
In [7]: pyogrio.read_dataframe("test3.gpkg")
Out[7]:
pop_est continent name iso_a3 gdp_md_est geometry
0 889953.0 Oceania Fiji FJI 5496 MULTIPOLYGON (((180.00000 -16.06713, 180.00000...
1 58005463.0 Africa Tanzania TZA 63177 POLYGON ((33.90371 -0.95000, 34.07262 -1.05982...
2 603253.0 Africa W. Sahara ESH 907 POLYGON ((-8.66559 27.65643, -8.66512 27.58948...
3 37589262.0 North America Canada CAN 1736425 MULTIPOLYGON (((-122.84000 49.00000, -122.9742...
4 328239523.0 North America United States of America USA 21433226 MULTIPOLYGON (((-122.84000 49.00000, -120.0000...
.. ... ... ... ... ... ...
172 6944975.0 Europe Serbia SRB 51475 POLYGON ((18.82982 45.90887, 18.82984 45.90888...
173 622137.0 Europe Montenegro MNE 5542 POLYGON ((20.07070 42.58863, 19.80161 42.50009...
174 1794248.0 Europe Kosovo -99 7926 POLYGON ((20.59025 41.85541, 20.52295 42.21787...
175 1394973.0 North America Trinidad and Tobago TTO 24269 POLYGON ((-61.68000 10.76000, -61.10500 10.890...
176 11062113.0 Africa S. Sudan SSD 11998 POLYGON ((30.83385 3.50917, 29.95350 4.17370, ...
[177 rows x 6 columns]
Ideally we would also be able to infer the geometry column name from the presence of the geometry extension name, but I don't know how to port these lines.
I was going to say: we can do that on the python side where this should be simple, but of course I was thinking about a pyarrow table where we can easily check the schema. To make it generic for any object implementing the Arrow PyCapsule protocol, it indeed needs to be done similarly like the above. Will try to take a stab at this.
But so summarizing, we can probably do something along those lines:
geometry_name
keyword for if you want to specify this explicitly as a user (eg if your column has a custom name / doesn't have geoarrow metadata set)create_fields_from_arrow_schema
to call OGR_L_CreateFieldFromArrowSchema
), we try to infer if a column is a geometry column (with similar logic as the snippet from GDAL, i.e. checking the metadata). And when a column name is equal to the one passed as geometry_name
or inferred as geometry type, we skip creating a field for it.Not the nicest piece of code, but it seems to be working ;)
(writing a table that was read with pyogrio.raw.read_arrow
, and thus has a "ogc.wkb" column, works without specifying the geometry column name)
But the written file is still getting corrupted. Here the GeoJSON ends abruptly
I think the last commit should fix this, by ensuring we properly close the GDAL dataset object.
I think the last commit should fix this, by ensuring we properly close the GDAL dataset object.
Do you have an idea why CI is failing?
The latest run seems to be mostly passing, except for linting (I think we switched to ruff since the PR was started, so might have to merge in main to get those changes and run pre-commit), and except for on Mac build (but that's a flaky segfault that has been happening lately, independent from this PR, so something we should try to investigate and fix, but nothing to worry about for this PR)
Thanks for pushing on this; this is really exciting!
Pushed a few more changes:
write_arrow()
signature, changed they first keywords order to data, path
instead of path, data
. That makes it consistent with the write_dataframe
(not consistent with the "raw" write function, but in functionality it is closer to the geopandas-based version)write_dataframe
, just updating for the data keyword, and for geometry_type
and geometry_name
@himikof @brendan-ward thanks a lot for the extensive review! Already addressed a few of the comments, will need to go through the bulk of them one of the coming days
Can you please add a test when
crs
is not passed in to verify the warning is raised?If possible, would be good to have that also test (or in a separate test case) if GDAL auto detects CRS from metadata on the geometry column (per the
TODO
comment).
Added a test for not providing a crs
. That directly tests the second aspect as well, because the table that we are writing in test was written with the result of read_arrow
which will have this CRS metadata in its schema.
Can you please add a test for a driver that doesn't support write? See
test_raw_io.py::test_write_unsupported
.It would probably also be good to probe at close errors on write, similar to
test_raw_io.py::test_write_gdalclose_error
.It might be good to test other drivers similar to
test_raw_io.py::test_write_supported
, though it looks like only FGB would be particularly unique there.If possible, please add tests for append capability supported / unsupported, similar to
test_raw_io.py::test_write_append
/test_raw_io.py::test_write_append_unsupported
Copied and adapted those tests from raw io (and fixed the append=True
case, we were still incorrectly trying to add new fields to the existing layer). Only for geojson-like formats I get some error that I still have to report upstream.
Is the idea to add support to
write_dataframe
in a later PR? (would be a good idea, this is already getting pretty big)
Yes, that was my idea. I already have a branch locally that starts testing that (most errors are about the missing support for promoting the geometry type, but which is something for which we could add some custom shapely code in the geopandas -> arrow conversion), but indeed was planning to leave that for a separate PR ;)
Honestly it was mostly @jorisvandenbossche , but I'm happy to have got the ball rolling!
A very WIP implementation for writing Arrow. This is mostly to start discussion and is largely inspired/copied from https://github.com/OSGeo/gdal/pull/9133.
Notes:
pyarrow.Table
,pyarrow.RecordBatchReader
, and geoarrow-rs'GeoTable
. TheRecordBatchReader
means that it can handle an iterator of batches, so they don't all have to be materialized at once in a Python Table.Todo:
References:
OGR_L_WriteArrowBatch
Closes https://github.com/geopandas/pyogrio/issues/314.