OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.91k stars 2.55k forks source link

OGRLayer::GetArrowStream(): add a DATETIME_AS_STRING=YES/NO option #11213

Open rouault opened 1 week ago

rouault commented 1 week ago

Fixes https://github.com/geopandas/pyogrio/issues/487

Fixes #11212

coveralls commented 1 week ago

Coverage Status

coverage: 73.692% (+0.004%) from 73.688% when pulling 5468ef26046bb5e862dd615fb4721f1ecefd27bc on rouault:GetArrowStream_DATETIME_AS_STRING into 971762d900faee5996faa667080cf69e67fd6ea4 on OSGeo:master.

rouault commented 1 week ago

@theroggy @jorisvandenbossche I'm thinking that in this DATETIME_AS_STRING=YES mode, in the ArrowSchema of datetime fields exposed as string (format='u'), we should probably also set the metadata field with a hint for the DateTime semantics. Any suggestion of an appropriate value for it?

jorisvandenbossche commented 1 week ago

Thanks a lot for looking into this!

we should probably also set the metadata field with a hint for the DateTime semantics. Any suggestion of an appropriate value for it?

Would you just want to indicate that the original GDAL/OGR type was a DateTime? Or is there more information about the column that GDAL can know at that point?
For the type, maybe something like "gdal:type": "DateTime" ? (there is not yet any precedence where you store some information like this is any file format?)

rouault commented 1 week ago

Would you just want to indicate that the original GDAL/OGR type was a DateTime?

actually, I'm just remembering that we have already something. https://gdal.org/en/latest/doxygen/classOGRLayer.html#a3ffa8511632cbb7cff06a908e6668f55 mentions:

Starting with GDAL 3.8, the ArrowSchema::metadata field filled by the get_schema() callback may be set with the potential following items:
    "GDAL:OGR:alternative_name": value of OGRFieldDefn::GetAlternativeNameRef()
    "GDAL:OGR:comment": value of OGRFieldDefn::GetComment()
    "GDAL:OGR:default": value of OGRFieldDefn::GetDefault()
    "GDAL:OGR:subtype": value of OGRFieldDefn::GetSubType()
    "GDAL:OGR:width": value of OGRFieldDefn::GetWidth() (serialized as a string)
    "GDAL:OGR:unique": value of OGRFieldDefn::IsUnique() (serialized as "true" or "false")
    "GDAL:OGR:domain_name": value of OGRFieldDefn::GetDomainName()

Those are only filled when they cannot be expressed with an Arrow concept. So logically that should be extended with "GDAL:OGR:type": "DateTime" in that situation