geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
258 stars 22 forks source link

Read Arrow: Support for new options as of GDAL 3.8 #345

Open kylebarron opened 5 months ago

kylebarron commented 5 months ago

I was just re-reading the GetArrowStream API docs and saw that there are two new options as of GDAL 3.8:

It would be great to support these. The metadata encoding would be useful for my purposes. I don't use timezone information as often, but someone probably would like that.

brendan-ward commented 5 months ago

Looks like we'd need to add a keyword on ogr_open_arrow to set the geometry encoding, maybe geometry_encoding="wkb"|"geoarrow" (not finding "ogc" as a meaningful value here compared to "wkb" which is specific).

brendan-ward commented 5 months ago

Oops, I mistakenly conflated the actual geometry encoding - which looks like it is either WKB or native (which may be GeoArrow) without a way to specifically request GeoArrow geometry encoding - with the metadata that sets extension types (which I don't entirely follow here).

Does the geoarrow.wkb extension type mean that the actual geometry encoding is still (OGC) WKB, or is it serialized GeoArrow bytes. Apologies for not being more up to speed on GeoArrow!

kylebarron commented 5 months ago

Looks like we'd need to add a keyword on ogr_open_arrow to set the geometry encoding, maybe geometry_encoding="wkb"|"geoarrow" (not finding "ogc" as a meaningful value here compared to "wkb" which is specific).

I agree this is confusing 🙂 . The key is that the switch does not change the actual encoding of the geometry. The geometries are always WKB. Rather, the switch only changes the metadata applied to the Arrow geometry column. As it is, the geometry column is always loaded with an Arrow extension name of ogc.wkb, which is "non-standard" (if you call GeoArrow a standard 😅).

What this toggle does is change the extension name from ogc.wkb to geoarrow.wkb and includes the CRS as projjson onto the arrow metadata. So for example, right now read_arrow returns a tuple with (metadata, table) = read_arrow() where the metadata object includes information about the CRS. When you set GEOMETRY_METADATA_ENCODING=GEOARROW, that metadata object is no longer needed because the CRS is stored on the table's internal metadata.

brendan-ward commented 5 months ago

Thanks for the explanation. It seems like this is therefore more of a boolean, opt-in option. metadata="ogc"|"geoarrow" doesn't seem great though it somewhat parallels the options on the GDAL side.

What about geoarrow_metadata=False|True? Where True means GEOMETRY_METADATA_ENCODING=GEOARROW and returns the stream with the additional things in the metadata. And presumably if set it means that we don't need to separately try to resolve the CRS using other calls to GDAL.

jorisvandenbossche commented 5 months ago

For the geoarrow metadata, I would personally consider making that the default here (I understand that GDAL didn't do that for backwards compatibility, but given this is quite new, I would still make the change here)

jorisvandenbossche commented 5 months ago

The TIMEZONE option was added after an issue I opened ((https://github.com/OSGeo/gdal/issues/8460)) triggered by our testing here when adding timezone read/write support in pyogrio.

The problem is that GDAL's data model stores a fixed offset to UTC per individual value, and not an actual time "zone" (like "Europe/Brussels"). While for the Arrow output, it needs to be a single timezone indicator for the full column (as it is part of the column's type). For some formats, Even added a specialized code path to preserve the timezone information. But for other formats, the best it can do is either preserve the wall time (unknown time zone) or convert to UTC.

And so my understanding is that the option allows you to choose between those. But the default is driver-dependent, so the option allows you to override the default of the driver. It's a bit complex to clearly explain ..

kylebarron commented 4 months ago

I don't work with timezones a lot so in #366 I took the simpler route and only tried to add geoarrow metadata handling