datafusion-contrib / datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Apache License 2.0
28 stars 8 forks source link

Fix Timestamp to align with ORC spec #76

Open Jefffrey opened 3 months ago

Jefffrey commented 3 months ago

ORC timestamp type is not straightforward, as though it apparently represents a timestamp without a timezone, its encoding & decoding is still dependent on the writer timezone (encoded in the stripe) and the reader timezone.

This is why the 1900 & 2038 integration tests are currently disabled as our implementation is incorrect:

https://github.com/datafusion-contrib/datafusion-orc/blob/3369b5d7bb110dafde64288416c86bc8c996e465/tests/integration/main.rs#L155-L165

Jefffrey commented 3 months ago

FYI I've raised a PR to update ORC spec for timestamp: https://github.com/apache/orc/pull/1867

Since that was what caused initial confusion for me, as some key implementation details were not mentioned in the spec leading to the initial inaccurate implementation

Also worth noting how arrow deals with this:

https://github.com/apache/arrow/blob/24feab091ab5a05b1cec234f51bd0223e2c41487/cpp/src/arrow/adapters/orc/adapter.cc#L178-L181

https://github.com/apache/arrow/issues/34590

Jefffrey commented 3 months ago

Note that the 2038 test is dependent on https://github.com/chronotope/chrono-tz/issues/155 as it tests timestamps 2100 and beyond, which chrono_tz currently has inaccurate DST for

Edit: https://github.com/chronotope/chrono-tz/issues/135

Jefffrey commented 3 months ago

Partially resolved by https://github.com/datafusion-contrib/datafusion-orc/commit/60288cdee57f72dec84c3fd9f6085561568aad49

Still pending chrono_tz fix for large date DST