OSGeo / gdal

GDAL is an open source MIT licensed translator library for raster and vector geospatial data formats.
https://gdal.org
Other
4.69k stars 2.48k forks source link

ArrowStream interface: support for timezones #8460

Closed jorisvandenbossche closed 9 months ago

jorisvandenbossche commented 10 months ago

Expected behavior and actual behavior.

Assume a geojson file test_datetime_tz.geojson with content:

{
"type": "FeatureCollection",
"features": [
{ "type": "Feature", "properties": { "col": "2020-01-01T09:00:00-05:00" }, "geometry": { "type": "Point", "coordinates": [ 1.0, 1.0 ] } },
{ "type": "Feature", "properties": { "col": "2020-01-01T10:00:00-05:00" }, "geometry": { "type": "Point", "coordinates": [ 2.0, 2.0 ] } }
]
}

and reading this with the ArrowStream interface:

>>> from osgeo import ogr
>>> ds = ogr.Open("test_datetime_tz.geojson")
>>> lyr = ds.GetLayer(0)
>>> stream = lyr.GetArrowStreamAsPyArrow()
>>> list(stream)[0]
<pyarrow.lib.StructArray object at 0x7fedaa11dc00>
-- is_valid: all not null
-- child 0 type: int64
  [
    0,
    1
  ]
-- child 1 type: timestamp[ms]
  [
    2020-01-01 09:00:00.000,
    2020-01-01 10:00:00.000
  ]
-- child 2 type: binary
  [
    0101000000000000000000F03F000000000000F03F,
    010100000000000000000000400000000000000040
  ]

We get tz-naive timestamp values in the "local" time (wall time), not the UTC-equivalent values (after applying the offset). While the "traditional" field based access with OGR_F_GetFieldAsDateTimeEx will give you timezone information.

It's not necessarily clear what would be the best behaviour here. GDAL doesn't actually store or know the time "zone" (like Europe/Brussels), only an offset per value (which is not necessarily constant for the full column). And thus it cannot return an Arrow timestamp type with a zone as tz. It could return a timestamp with a fixed offset (but that's only possible if all offsets are the same, which you only know when parsing the last value, and so this would require a first pass to check this), or return timestamp[ms, tz=UTC] type (this doesn't preserve the wall time, but it does preserve the actual point in time). Although I can imagine that it depends on the use case whether you would prefer getting wall-time tz-naive timestamps or tz-aware UTC timestamps.

GDAL version and provenance

GDAL 3.7.2 installed with conda forge

jorisvandenbossche commented 10 months ago

Probably not directly related, but I also noticed that the equivalent GPKG file (oogr2ogr test_datetime_tz.gpkg test_datetime_tz.geojson) gives a warning when reading with Arrow:

Warning 1: Non-conformant content for record 1 in column col, 2020-01-01T09:00:00.000-05:00, successfully parsed
jratike80 commented 10 months ago

Datetime in GeoPackage must be in UTC by the standard, I guess that's the reason for the warning.

ISO-8601 date/time string in the form YYYY-MM-DDTHH:MM:SS.SSSZ with T separator character and Z suffix for coordinated universal time (UTC) encoded in either UTF-8 or UTF-16. See TEXT.

Have a look at the DATETIME_FORMAT in https://gdal.org/drivers/vector/gpkg.html#dataset-creation-options. Could there be something re-usable?

jorisvandenbossche commented 10 months ago

OK, that's interesting. But in this case, it's GDAL itself that created the file, so if the spec requires UTC, shouldn't GDAL write that? (by converting any other offset to UTC) Or is it expected that the user knows to convert first to UTC before writing to GPKG?

rouault commented 10 months ago

But in this case, it's GDAL itself that created the file, so if the spec requires UTC, shouldn't GDAL write that? (by converting any other offset to UTC) Or is it expected that the user knows to convert first to UTC before writing to GPKG?

GDAL kind of uses an informal extension to write non-UTC date times, if the user provides them as such. If a user wants to produce a fully compliant file, he can pass DATETIME_FORMAT=UTC as a creation option and GDAL will do the conversion. Anyway that's not directly related to that issue, which I'm working on

jorisvandenbossche commented 10 months ago

@jratike80 I see your added second paragraph now. And indeed, the DATETIME_FORMAT param clearly mentions that it defaults to writing datetimes with timezone, while that is not spec-compliant. And you can use DATETIME_FORMAT="UTC" to get a spec compliant one. Thanks for pointing that out, that can resolve the warning I got for GPKG.

EDIT: and I too slow with typing, in the meantime Even answered more or less the same ;)

rouault commented 10 months ago

with the fix in #8461, I now get:

-- is_valid: all not null
-- child 0 type: int64
  [
    0,
    1
  ]
-- child 1 type: timestamp[ms, tz=-05:00]
  [
    2020-01-01 09:00:00.000,
    2020-01-01 10:00:00.000
  ]
-- child 2 type: binary
  [
    0101000000000000000000F03F000000000000F03F,
    010100000000000000000000400000000000000040
  ]

GeoJSON is the only driver that does a full scan of the file to establish the set of fields on opening, hence the timezone can be known. For GeoPackage, the timezone in the schema will be advertized as UTC, and conversions will be done if necessary For other drivers, the current behaviour will be preserved: unknown time zone in the schema, and value of the feature written unmodified EDIT: (except Arrow and Parquet), for which this comes for free

jratike80 commented 10 months ago

the DATETIME_FORMAT param clearly mentions that it defaults to writing datetimes with timezone

I would rather say that it defaults to writing datetimes as they appear in the source data. I think that if the default was automatic conversion into UTC, then it would make many users confused and unhappy because if the time is around midnight then the date can change by the same.

jorisvandenbossche commented 10 months ago

@rouault thanks for that! The PR looks good.

For other drivers, the current behaviour will be preserved: unknown time zone in the schema, and value of the feature written unmodified

In general, additional timezone support in the arrowstream output (as you did now for the mentioned formats) is something that would need to be added per driver? (although I don't know if there are many others that do support timezone in the format, Shapefile don't even support datetimes AFAIK)

EDIT: (except Arrow and Parquet), for which this comes for free

Although I assume (from reading the code, didn't try it out, so might be wrong) that a timestamp field in such a file that would eg have a timezone of "Europe/Brussels" will be set to OGR_TZFLAG_UNKNOWN in the internal data model? And thus come out as the original value as stored. But since that stored value for Arrow IPC and Parquet file formats is always UTC, it might be more correct to set the TZFLAG to UTC if the timezone is set but not a fixed offset?

jratike80 commented 10 months ago

I think that I now understand what @jorisvandenbossche means with the timezones. Obviously they are the entries in the tz database https://en.wikipedia.org/wiki/List_of_tz_database_time_zones. I had been thinking that the offsets like +02:00 mean the same as timezones but that would have been too easy.

If GDAL would support the tz timezones, doesn't it mean that it should support them properly, so that the Daylight saving times would are also handled correctly? The offset to UTC in Europe/Brussels times can be either +01:00 or +02:00 depending on the day of year. I do not know but I fear that they depend also on the year, because the rules for DST may have changed.

It may be too late to change anything, but isn't it so, that GDAL does not know anything about the tz timezones? And the best that GDAL can do is to deal with the UTC Offsets? So maybe it would be more exact to use that term in the documentation as well: "For other drivers, the current behaviour will be preserved: unknown UTC Offset in the schema"

rouault commented 10 months ago

In general, additional timezone support in the arrowstream output (as you did now for the mentioned formats) is something that would need to be added per driver?

yes, drivers can either set OGRFieldDefn::SetTZFlag(), or override manually the TIMESTAMP GetArrowStream() option. I've also declare it as public option

Although I assume (from reading the code, didn't try it out, so might be wrong) that a timestamp field in such a file that would eg have a timezone of "Europe/Brussels" will be set to OGR_TZFLAG_UNKNOWN in the internal data model? And thus come out as the original value as stored. But since that stored value for Arrow IPC and Parquet file formats is always UTC, it might be more correct to set the TZFLAG to UTC if the timezone is set but not a fixed offset?

I've just done that in a new commit

GDAL does not know anything about the tz timezones

Not more than (fixed) offsets to UTC. It has no knowledged of things like Europe/Brussels and DST.