Open kylebarron opened 5 months ago
Looks like we'd need to add a keyword on ogr_open_arrow
to set the geometry encoding, maybe geometry_encoding="wkb"|"geoarrow"
(not finding "ogc" as a meaningful value here compared to "wkb" which is specific).
Oops, I mistakenly conflated the actual geometry encoding - which looks like it is either WKB or native (which may be GeoArrow) without a way to specifically request GeoArrow geometry encoding - with the metadata that sets extension types (which I don't entirely follow here).
Does the geoarrow.wkb
extension type mean that the actual geometry encoding is still (OGC) WKB, or is it serialized GeoArrow bytes. Apologies for not being more up to speed on GeoArrow!
Looks like we'd need to add a keyword on
ogr_open_arrow
to set the geometry encoding, maybegeometry_encoding="wkb"|"geoarrow"
(not finding "ogc" as a meaningful value here compared to "wkb" which is specific).
I agree this is confusing 🙂 . The key is that the switch does not change the actual encoding of the geometry. The geometries are always WKB. Rather, the switch only changes the metadata applied to the Arrow geometry column. As it is, the geometry column is always loaded with an Arrow extension name of ogc.wkb
, which is "non-standard" (if you call GeoArrow a standard 😅).
What this toggle does is change the extension name from ogc.wkb
to geoarrow.wkb
and includes the CRS as projjson onto the arrow metadata. So for example, right now read_arrow
returns a tuple with (metadata, table) = read_arrow()
where the metadata object includes information about the CRS. When you set GEOMETRY_METADATA_ENCODING=GEOARROW
, that metadata object is no longer needed because the CRS is stored on the table's internal metadata.
Thanks for the explanation. It seems like this is therefore more of a boolean, opt-in option. metadata="ogc"|"geoarrow"
doesn't seem great though it somewhat parallels the options on the GDAL side.
What about geoarrow_metadata=False|True
? Where True
means GEOMETRY_METADATA_ENCODING=GEOARROW
and returns the stream with the additional things in the metadata. And presumably if set it means that we don't need to separately try to resolve the CRS using other calls to GDAL.
For the geoarrow metadata, I would personally consider making that the default here (I understand that GDAL didn't do that for backwards compatibility, but given this is quite new, I would still make the change here)
The TIMEZONE
option was added after an issue I opened ((https://github.com/OSGeo/gdal/issues/8460)) triggered by our testing here when adding timezone read/write support in pyogrio.
The problem is that GDAL's data model stores a fixed offset to UTC per individual value, and not an actual time "zone" (like "Europe/Brussels"). While for the Arrow output, it needs to be a single timezone indicator for the full column (as it is part of the column's type). For some formats, Even added a specialized code path to preserve the timezone information. But for other formats, the best it can do is either preserve the wall time (unknown time zone) or convert to UTC.
And so my understanding is that the option allows you to choose between those. But the default is driver-dependent, so the option allows you to override the default of the driver. It's a bit complex to clearly explain ..
I don't work with timezones a lot so in #366 I took the simpler route and only tried to add geoarrow metadata handling
I was just re-reading the
GetArrowStream
API docs and saw that there are two new options as of GDAL 3.8:It would be great to support these. The metadata encoding would be useful for my purposes. I don't use timezone information as often, but someone probably would like that.