Open bdice opened 1 month ago
Thank you for reporting, @bdice .
cc @williamhyun , @wgtmac , too.
To @bdice , according to our official Java tool, the type of column time
is timestamp
without timezone, isn't it?
$ orc-tools version
ORC 2.0.2
$ orc-tools meta ./examples/TestOrcFile.testDate1900.orc | grep Type
Processing data file examples/TestOrcFile.testDate1900.orc [length: 30941]
Type: struct<time:timestamp,date:date>
Please see here. Given that there is no timezone, I'm not sure if the root cause is the file.
ORC includes two different forms of timestamps from the SQL world:
- Timestamp is a date and time without a time zone, which does not change based on the time zone of the reader.
- Timestamp with local time zone is a fixed instant in time, which does change based on the time zone of the reader.
Instead, it looks like the C++ library side issue because orc-metadata
is based on C++ library. BTW, ORC-1481 was fixed already at Apache ORC 2.0.0. Do you mean that you hit this issue with Apache ORC 2.0+?
It looks like a breaking change of timezone name from TZDB. I will take a look. cc @ffacs
Thank you so much, @wgtmac .
https://bugs.launchpad.net/ubuntu/+source/tzdata/+bug/2058249 has explained the root cause that tzdata
has moved timezone files like US/Pacific
to a separate tzdata-legacy
library without providing symlinks by intention so it is a breaking change to legacy ORC files. At the same time, some downstream projects depending on Apache ORC C++ library uses ORC files from https://github.com/apache/orc/tree/main/examples for CI validation. These CI jobs start to fail once they upgrade to Ubuntu 24.04 which uses the new version of tzdata
without tzdata-legacy
installed.
IMO, we should not change TestOrcFile.testDate1900.orc
as it is a good example to check if tzdata-legacy
is required. One thing that I don't understand is that we have CI jobs running on Ubuntu 24.4 but they do not fail.
IMO, we should not change
TestOrcFile.testDate1900.orc
as it is a good example to check iftzdata-legacy
is required.
That is fine with me! I have worked around this by installing tzdata-legacy
on Ubuntu 24.04. I can see the potential value here. I am okay with closing this issue with no action, if that is acceptable to others.
Another possible course of action would be to leave TestOrcFile.testDate1900.orc
as-is, and update the timezone names in TestOrcFile.testDate2038.orc
(currently also using US/Pacific
).
@bdice I think we can keep those files are they are created by legacy writers: "format": "0.12", "writer version": "HIVE-8732", "software version": "ORC Java"
. We can use the latest writer to generate new file with equivalent data but with new timezone names.
The example ORC files use a timezone of
US/Pacific
which is no longer included in all Linux distributions. Ubuntu 24.04, for example, has moved this to a separatetzdata-legacy
package. This can cause issues for ORC file readers on systems missing that legacy time zone data.Should the example ORC files be updated to use a more current time zone name, like
America/Los_Angeles
?Verifying the time zone in the stripe footers:
Additional context
https://bugs.launchpad.net/ubuntu/+source/tzdata/+bug/2058249 https://github.com/apache/arrow/issues/40633 https://github.com/pandas-dev/pandas/issues/56292 https://github.com/rapidsai/cudf/pull/16998#issuecomment-2400980607