apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.51k stars 3.53k forks source link

Provide more ways to publish Binary Artifacts #40760

Open laozhoubuluo opened 7 months ago

laozhoubuluo commented 7 months ago

Describe the enhancement requested

Currently, apache.jfrog.io is the only release channel for various Binary Artifacts of Apache Arrow. Therefore, if apache.jfrog.io stops serving, there will be no other channel to obtain Binary Artifacts. However, apache.jfrog.io does not provide stable service. The following is a list of multiple service outages in recent years.

https://github.com/apache/arrow/issues/12686 https://github.com/apache/arrow/issues/34675 https://github.com/apache/arrow/issues/40744 https://github.com/apache/arrow/issues/40759

Considering that this problem has occurred many times, it has seriously affected various downstream software that relies on Apache Arrow. Whether to consider adding at least one other release channel for the Binary Artifacts of Apache Arrow, such as GitHub Releases or downloads.apache.org to avoid apache.jfrog.io becoming the only release channel for Binary Artifacts. And provide alternatives for obtaining Binary Artifacts in the installation instructions, even if it may require more troublesome methods such as manual installation of DPKG.

The implementation of this measure can effectively reduce risks in the downstream software supply chain and avoid the risk that a single component cannot be installed, resulting in the entire software being unable to be installed, or even the software being unable to continue to work normally.

Component(s)

Release

assignUser commented 7 months ago

+1 I talked about this with @jbonofre yesterday and he mentioned that while repository.apache.org currently only hosts java binaries the underlying nexus software can host a number of package repos (rpm, deb, python,...).

GitHub releases could certainly also be a fallback for some things but they don't over repo functionality we need for at the minimum the Linux packages.

cc @raulcd @kou

jbonofre commented 7 months ago

I think we have different options: nexus, dist, gh.

Let me investigate a bit what could be the easiest one to integrate in our build.

kou commented 7 months ago

while repository.apache.org currently only hosts java binaries the underlying nexus software can host a number of package repos (rpm, deb, python,...).

Could you share a document URL how to use repository.apache.org for RPM/deb/wheel?

I think we have different options: nexus, dist, gh.

I think that we can't use dist.apache.org because our binaries are large to use dist.apache.org.

I think that we can use GitHub Releases for some binaries (which don't require metadata for package repository) but we can't use GitHub Releases for others (which require metadata for package repository, e.g. RPM/deb/wheel).

assignUser commented 7 months ago

I assume we can use the API toupload binaries once a matching repository is created, but I haven't looked into it in detail/spoken with infra

kou commented 7 months ago

Thanks.

It seems that we can use deb with Nexus Repository Manager 3 or later: https://help.sonatype.com/en/repository-manager-feature-matrix.html

It seems that https://repository.apache.org/ used Nexus Repository Manager 2:

Nexus Repository Manager 2.14.20-02

Could you ask INFRA whether there is a plan to upgrade repository.apache.org or not?

kou commented 7 months ago

I think that we can use GitHub Releases for some binaries (which don't require metadata for package repository) but we can't use GitHub Releases for others (which require metadata for package repository, e.g. RPM/deb/wheel).

I was wrong. We can use GitHub Releases for wheel because we publish the voted wheels to https://pypi.org/.

jbonofre commented 7 months ago

No need to upgrade to Nexus 3: even with Nexus 2 we can upload any kind of files, via HTTPs for instance. No need to use the Nexus API. Maven release plugin is "just" HTTPs client (via aether).

For instance, in Apache Karaf, I publish features XML, tar.gz, zip, etc.

Manually, it's possible to use mvn deploy:deploy-file providing the artifact type, etc.

kou commented 7 months ago

Could you tell us which files are uploaded to repository.apache.org? It seems that files listed in https://karaf.apache.org/download.html use dist.apache.org not repository.apache.org.

jbonofre commented 7 months ago

At Karaf (like most of other Apache projects) we are using both:

We do almost the same in Arrow: the source distributions are on dist (https://dist.apache.org/repos/dist/release/arrow/).

By the way, as dist.apache.org artifacts are automatically copy to archives.apache.org, dist.apache.org should only content only the latest releases (for instance 15.0.0 and 15.0.1, etc should be deleted from dist.apache.org).

raulcd commented 7 months ago

Yes, sorry, I have to run the remove artifacts task from the post release tasks. I'll do it today.

raulcd commented 7 months ago

I've removed old Releases from dist.apache.org

jbonofre commented 7 months ago

@raulcd Thanks ! Much appreciated ! And no worries at all đŸ˜„

kou commented 7 months ago

Thanks.

If we use repository.apache.org for .deb/.rpm, we use https://repo1.maven.org/maven2/org/apache/arrow/debian/ and so on for APT/Yum repositories, right? Hmm. Can we use apache.org domain instead of maven.org domain?

FYI: Our upload script for Java: https://github.com/apache/arrow/blob/main/dev/release/06-java-upload.sh It uses mvn deploy:deploy-file.

jbonofre commented 7 months ago

maven.org is an alias to https://repository.apache.org/content/groups/public/ so yeah, we can use repository.apache.org name.