apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
665 stars 477 forks source link

ORC-1684: [C++] Find tzdb without TZDIR when in conda-environments #1882

Closed h-vetinari closed 2 months ago

h-vetinari commented 3 months ago

What changes were proposed in this pull request?

Find tzdb without having to set TZDIR when in a conda-environment (where tzdata has a uniform location of $CONDA_PREFIX/share/zoneinfo across all platforms).

Why are the changes needed?

This is due to issues in arrow (see https://github.com/apache/arrow/issues/36026) that cannot really be fixed there, as it assumes that orc >=2.0 knows how to find the tzdb. Having to set TZDIR in all user environments is an intrusive change that should be avoided, and since the cost here is checking a single environment variable, it's hopefully not too onerous for consideration.

How was this patch tested?

CI here

Was this patch authored or co-authored using generative AI tooling?

No

CC @wgtmac

See also: #1587

h-vetinari commented 3 months ago

If desired, I can test this patch on conda-forge infrastructure (build a patched orc that's published in a separate channel, then run https://github.com/conda-forge/arrow-cpp-feedstock/pull/1086 against that).

wgtmac commented 3 months ago

Thanks for the fix! Do you have a JIRA account? I can assign this to you.

h-vetinari commented 2 months ago

Would it be possible/acceptable to create a symlink in CI? I believe we could write a test that exercises the branch for CONDA_PREFIX here similar to https://github.com/apache/orc/blob/b75dcaa060ce4e9cb5bc4600b00bfb8769efc50b/c%2B%2B/test/TestTimezone.cc#L423 but it would need a working tzdb at the end of $CONDA_PREFIX/share/zoneinfo.

h-vetinari commented 2 months ago

Ah, just found https://github.com/apache/orc/blob/b75dcaa060ce4e9cb5bc4600b00bfb8769efc50b/cmake_modules/ThirdpartyToolchain.cmake#L340 So I guess it should be there?

h-vetinari commented 2 months ago

~Interestingly, I cannot inspect any CI logs here. Unfolding any section just gives a red Error:.~

Nevermind, it worked now after 5min...

wgtmac commented 2 months ago

Ah, just found

https://github.com/apache/orc/blob/b75dcaa060ce4e9cb5bc4600b00bfb8769efc50b/cmake_modules/ThirdpartyToolchain.cmake#L340

So I guess it should be there?

That is only used in WIN32 to make test pass on Windows...

h-vetinari commented 2 months ago

That is only used in WIN32 to make test pass on Windows...

Well, then we could still test that the CONDA_PREFIX mechanics work on windows, no? See my last commit, which tries to do this.

h-vetinari commented 2 months ago

OK, so the test I added passed now:

[ RUN      ] TestTimezone.testTzdbFromCondaEnv
[       OK ] TestTimezone.testTzdbFromCondaEnv (0 ms)

but restoring TZDIR still fails:

 unknown file: error: C++ exception with description "Time zone file ������������������������������������������������������������������������������/UTC does not exist. 

Since this is copied from testMissingTZDB, I'm surprised the same error doesn't happen there, because that test is saving tzDirBackup in exactly the same way - i.e. just a pointer to an environment variable which then gets deleted. It seems to me that the restoration of that environment variable cannot succeed (without a deepcopy before deleting TZDIR), and so far that test may have only passed due to checking if (tzDirBackup != nullptr), but - potentially - never actually restored the environment variable...? 🤔

h-vetinari commented 2 months ago

Tests are passing on windows now! 🥳

h-vetinari commented 2 months ago

This is all green now and should be ready for closer review. :)

ffacs commented 2 months ago

Thank you, @h-vetinari . This patch looks good to me if all comments from @wgtmac are fixed.

h-vetinari commented 2 months ago

Thanks for the reviews! I made the requested changes.

h-vetinari commented 2 months ago

I have a last question - I see that there's some sort of caching for the tzdb going on (or at least, there's a test for it): https://github.com/apache/orc/blob/695e0f33609adae0114708c3be756bb7b77f4d79/c%2B%2B/test/TestTimezone.cc#L346-L347

Can we delete that cache somehow to rule out that this is what's causing testTzdbFromCondaEnv to pass? Or is it enough to choose some timezones that haven't been queried yet to bypass the cache?

wgtmac commented 2 months ago

I have a last question - I see that there's some sort of caching for the tzdb going on (or at least, there's a test for it):

https://github.com/apache/orc/blob/695e0f33609adae0114708c3be756bb7b77f4d79/c%2B%2B/test/TestTimezone.cc#L346-L347

Can we delete that cache somehow to rule out that this is what's causing testTzdbFromCondaEnv to pass?

We can either use a different timezone to avoid caching or add a function to invalidate the cache.

h-vetinari commented 2 months ago

Sorry for the force-pushes; I had to figure out the right dates/times (which is a bit harder because we apparently need to specify them in UTC rather than local time; in the latter it would always be around 2/3am).

h-vetinari commented 2 months ago

OK, I think I'm done now from my side. This tests all the things I could think of as relevant.

h-vetinari commented 2 months ago

Thanks a lot for the review and help here @wgtmac! 🙏

Would you consider including this in the 2.0.1 release? I'm probably going to backport this in conda-forge if not, but it would be nicer if it could be included in the release here directly; seeing that it's just about checking a single environment variable, I think the risk here is very low.

wgtmac commented 2 months ago

Yes, I have already ported it to branch-2.0 and will be released in 2.0.1.

h-vetinari commented 2 months ago

Amazing, thanks a lot! 🤩