apache / orc

Apache ORC - the smallest, fastest columnar storage for Hadoop workloads
https://orc.apache.org/
Apache License 2.0
671 stars 477 forks source link

ORC-1635: Try downloading orc-format from dlcdn.apache.org before archive.apache.org #1830

Closed progval closed 4 months ago

progval commented 4 months ago

What changes were proposed in this pull request?

Try downloading orc-format from dlcdn.apache.org before archive.apache.org

This replaces https://github.com/apache/orc/pull/1820 which required dlcdn to have the current version.

Why are the changes needed?

https://archive.apache.org/ discourages heavy use, and its rate limits can cause CI systems building Apache ORC to be banned.

How was this patch tested?

It builds from a clean repo

Was this patch authored or co-authored using generative AI tooling?

no

dongjoon-hyun commented 4 months ago

Merged to main for Apache ORC 2.1.0. Thank you, @progval .

douardda commented 4 months ago

+1, LGTM (Pending CIs) for Apache ORC 2.1.0. Merged to main for Apache ORC 2.1.0. Thank you, @progval .

Thanks, but why not merging to all active branches? (at least 1.9 which is advertised as the current release and 2.0)? In a context of this being a fix mainly for CI workloads, makes sense to apply it on the current release version.

progval commented 4 months ago

v1.9 does not download orc-format so it's not applicable to that branch. But I agree it would be nice to have it on v2.0

dongjoon-hyun commented 4 months ago

I fully understand why you (@douardda and @progval) are thinking in that way. However, this is categorized as Improvement instead of Bug, isn't it? Technically, dlcdn.apache.org is a caching layer without any guarantee of file existence.

Screenshot 2024-03-01 at 08 33 10

From my perspective, main branch is the most active branch (for developing and receiving pull requests) which contributing the download traffic to ASF site. It's good to have because it will be in sync always with orc-format repository and its latest version. However, for the release branches, we don't think that's true.

In general, as one of the open source community,

  1. Apache ORC community follows Semantic Versioning officially.
  2. For all release branches, we don't backport the Improvement with one exception which tools module. There was an official decision about tools module exception in the dev mailing list.
  3. FYI, the feature freeze of Apache ORC v2.0 happens already when we cut a branch, branch-2.0. After that, we are focusing QA activity(testing, bug fixing and polishing) instead of core changes.

In short, this improvement contribution is simply late to be a part of 2.0. It will be 2.1.

progval commented 4 months ago

I opened the JIRA ticket as Bug instead of Improvement because I see this as a fixing a regression: v1.x didn't download the file so it builds fine from anywhere, v2.0 downloads the file from archive.apache.org so it can't be built on CIs

dongjoon-hyun commented 4 months ago

I believe that we already agreed that your original claim was wrong here (https://github.com/apache/orc/pull/1820#issuecomment-1964748734)

I opened the JIRA ticket as Bug instead of Improvement because I see this as a fixing a regression: v1.x didn't download the file so it builds fine from anywhere, v2.0 downloads the file from archive.apache.org so it can't be built on CIs

I'm wondering if you are still claiming that "it can't be built on CIs". What are you unable to build on CIs?

progval commented 4 months ago

I'm wondering if you are still claiming that "it can't be built on CIs".

I can now build the main branch, but not the v2.0 branch.

What are you unable to build on CIs?

I can't build the C++ code in the v2.0 branch of this repository on my CI because I get this error:

13:54:56    -- Downloading...
13:54:56       dst='/var/lib/jenkins/workspace/DGRPH/gitlab-builds/target/debug/build/orcxx-8d1ca2e7d12cd415/orc/orc-format_ep-prefix/src/orc-format-1.0.0.tar.gz'
13:54:56       timeout='none'
13:54:56       inactivity timeout='none'
13:54:56    -- Using src='https://archive.apache.org/dist/orc/orc-format-1.0.0/orc-format-1.0.0.tar.gz'
13:54:56  
13:54:56    --- stderr
13:54:56    CMake Error at orc-format_ep-stamp/download-orc-format_ep.cmake:170 (message):
13:54:56      Each download failed!
13:54:56  
13:54:56        error: downloading 'https://archive.apache.org/dist/orc/orc-format-1.0.0/orc-format-1.0.0.tar.gz' failed
13:54:56              status_code: 28
13:54:56              status_string: "Timeout was reached"
13:54:56              log:
13:54:56              --- LOG BEGIN ---
13:54:56                Trying 65.108.204.189:443...
13:54:56        Trying [2a01:4f9:1a:a084::2]:443...
13:54:56  
13:54:56      Immediate connect fail for 2a01:4f9:1a:a084::2: Cannot assign requested
13:54:56      address
13:54:56  
13:54:56      connect to 65.108.204.189 port 443 failed: Connection timed out
13:54:56  
13:54:56      Failed to connect to archive.apache.org port 443 after 129277 ms: Couldn't
13:54:56      connect to server
13:54:56  
13:54:56      Closing connection 0

after being blocked by archive.apache.org

dongjoon-hyun commented 4 months ago

To @progval , it's your network environment network issue which doesn't follow the ASF policy correctly.

after being blocked by archive.apache.org

As I mentioned before, when Apache ORC community release Apache ORC Format 1.0.1, your claim will be broken again because your network environment will block you to access Apache ORC Format 1.0.0. I'd like to recommend to fix your network permanently to allow the archive link.