datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
30 stars 8 forks source link

Timestamp instant support #13

Closed Jefffrey closed 3 months ago

Jefffrey commented 8 months ago

See Timestamp with local time zone here https://orc.apache.org/docs/types.html

Jefffrey commented 3 months ago

Writer timezone seems encoded at stripe level, which is problematic if that suggests timezone for a column can vary between stripes since Arrow encodes the timezone in the datatype so would need to be consistent for all data. Need to investigate this more

Jefffrey commented 3 months ago

It seems I misunderstood. According to:

The writer timezone in the stripe is used for regular Timestamp, as Timestamp instants are in UTC timezone.

Jefffrey commented 3 months ago

Separate note: encoding as Timestamp(Nanoseconds) severely limits the range representable in Arrow, need to keep this in mind

Jefffrey commented 3 months ago

Added by https://github.com/datafusion-contrib/datafusion-orc/commit/18880157be312f4720f1a6f3b961a87abcfae6a7