apache / arrow

Apache Arrow is a multi-language toolbox for accelerated data interchange and in-memory processing
https://arrow.apache.org/
Apache License 2.0
14.34k stars 3.48k forks source link

[Format]: Support an official "timestamp with time zone offset" type #44248

Open CurtHagenlocher opened 7 hours ago

CurtHagenlocher commented 7 hours ago

Describe the enhancement requested

Relational databases Snowflake, MSSQL, Oracle, Teradata, and SAP SQL Anywhere all support a data type which stores both a timestamp and a time zone offset. This differs from the existing Arrow timestamp type by letting each individual value in the column have a different offset and by not being tied to a geopolitical time zone. This type also appears in Java as OffsetDateTime and in .NET as DateTimeOffset. It would be nice given how commonly it appears if there were a standard way to represent this in Arrow.

This could be done as an extension type for a structure consisting of separate 8-byte timestamp and 2-byte offset values, or as a new first-class type. Intervals are a structure with some similarity to this type and were done as a first-class type, but they also predate the extension type mechanism.

Component(s)

Format

rok commented 5 hours ago

For the record arrow can currently store timezone offsets per array as strings see here. To store per value offsets an extension type sounds like a good idea. What temporal resolution would you propose? Minutes would fit into two bytes I suppose.

CurtHagenlocher commented 5 hours ago

The standard seems to be support for minute-level resolution with a range of something like -14:00 to +14:00. Storing the number of minutes as an int16 seems right.

rok commented 5 hours ago

Adding an extension type would start by opening a PR against CanonicalExtensions.rst describing the proposed type and calling for discussion/vote on the ML (e.g. 8-bit boolean). It might make sense to wait for more people to chime in before doing so though.