apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.56k stars 3.54k forks source link

[Python] ORC Reader aborts when timezone file is missing #40633

Open WillAyd opened 7 months ago

WillAyd commented 7 months ago

Describe the bug, including details regarding any error messages, version, and platform.

This is an upstream report of https://github.com/pandas-dev/pandas/issues/56292

I noticed when running the pandas test suite I was getting this error:

pandas/tests/io/test_orc.py::test_orc_reader_basic terminate called after throwing an instance of 'orc::TimezoneError'
  what():  Can't open /usr/share/zoneinfo/US/Pacific
Fatal Python error: Aborted

Current thread 0x00007eff1a912780 (most recent call first):

The workaround is to create that timezone file:

$ sudo mkdir -p /usr/share/zoneinfo/US
$ sudo ln -s /usr/share/zoneinfo/America/Los_Angeles /usr/share/zoneinfo/US/Pacific

Although I think the error should be handled more gracefully than via abort

Component(s)

Python

kou commented 7 months ago

@wgtmac will improve this. See also:

wgtmac commented 7 months ago

This seems to be related to the installed version of tz database on the test machine. I checked my laptop and the path /usr/share/zoneinfo/US/Pacific exists. Could you verify the version by checking /usr/share/doc/tzdata/version file? @WillAyd

WillAyd commented 7 months ago

That file does not exist for me. This is running popOS 22.04

kou commented 7 months ago

Could you try installing the tzdata-legacy package?

WillAyd commented 7 months ago

I don't see that package for 22.04 - I think first appeared in 23.04?

kou commented 7 months ago

Oh, sorry. Could you install tzdata?

WillAyd commented 7 months ago

It is already installed - tzdata is already the newest version (2024a-0ubuntu0.22.04).

kou commented 7 months ago

Hmm. tzdata must install /usr/share/zoneinfo/US/Pacific: https://packages.ubuntu.com/jammy/all/tzdata/filelist

WillAyd commented 7 months ago

Ah OK - interesting indeed. That must have been deleted off of my system somehow, but I do see that in a recovery OS.

Happy to close this issue if we want to chalk it up to an unsupported system configuration

aureliobarbosa commented 7 months ago

Could you try installing the tzdata-legacy package?

I also observed pyarrow breaking while processing ORC files, due to inexistent IANA keys. Those were observed on running the pandas test suit locally, but just trying to read some pre-existent ORC files completely broke python and ipython. My setup includes Ubuntu Mantic, Python 3.11 and tzdata version 2024.1.

At least in my case, installing tzdata-legacy system wide was enough to get ride of those errors.

rhshadrach commented 1 month ago

I've been debugging this issue and independently found the same solution - installing tzdata-legacy. Just stashing my error message here in case it is helpful for others.

pyarrow.lib.ArrowInvalid: Cannot locate timezone 'US/Eastern': US/Eastern not found in timezone database