apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
820 stars 163 forks source link

Create binary releases #721

Closed andygrove closed 1 month ago

andygrove commented 3 months ago

What is the problem the feature request solves?

Comet will be easier to adopt if we publish JAR files to Maven.

Describe the potential solution

No response

Additional context

No response

parthchandra commented 2 months ago

@andygrove @viirya I'm planning to work on this. Here's what I am proposing -

The process will follow the same workflow as we currently have with two additional steps -

  1. Build an Uber jar with a select set of native binaries packaged in the jar
  2. Add a deployment management step in maven to deploy the built jars to Apache maven.

To build the uber jar

  1. Create a comet-rm docker image (for comet release manager). The comet-rm image will be based on Ubuntu 20.04 (TODO: confirm this) because the Spark 3.4 release docker image also uses this image. This base image supports amd64 and arm64v8 (arm64) architectures which we will use. The comet-rm image will package all the dependencies to build comet successfully. The entrypoint for this image will get comet from git and build a linux and macOs native binary for the architecture.
  2. Create a build-release-binaries script. This will -
    1. Launch a docker container for both amd64 and arm64 architectures which will build the binaries for both linux and MacOs. We will end up with four native libs (in two containers).
    2. Copy the built binaries in the containers to the local build directories in the appropriate subdirectories
    3. Run maven package to build the jar locally.

Deployment to maven
This will follow the steps outlined in the official guidance from Apache Infra https://infra.apache.org/publishing-maven-artifacts.html We do not plan to use the maven release plugin, but the rest of the guidance is valid. We simply need to add the deploymentManagement section in the pom file and document the steps for the release manager to set up their credentials.

wdyt?

viirya commented 2 months ago

Launch a docker container for both amd64 and arm64 architectures which will build the binaries for both linux and MacOs. We will end up with four native libs (in two containers).

For MacOS cross build, we may need to include MacOS X sdk extracted from Xcode in the docker image. So I guess we cannot distribute the docker image in the open source community.

If so, we need to build the docker image every time when we are going to run the build-release-binaries script.

viirya commented 2 months ago

Deployment to maven This will follow the steps outlined in the official guidance from Apache Infra https://infra.apache.org/publishing-maven-artifacts.html We do not plan to use the maven release plugin, but the rest of the guidance is valid. We simply need to add the deploymentManagement section in the pom file and document the steps for the release manager to set up their credentials.

In Spark, these steps are mostly automated by release script (See the process in https://spark.apache.org/release-process.html#preparing-for-release-candidates).

For example, in Spark repo, dev/create-release/release-build.sh is used to publish snapshot release to Apache snapshots or a release to Apache release repo, etc. I used to do Spark release by running these scripts.

I took a quick look. For publishing releases, it automates the steps described in the Apache Infra doc. We can probably write a release script for Comet.

parthchandra commented 2 months ago

Launch a docker container for both amd64 and arm64 architectures which will build the binaries for both linux and MacOs. We will end up with four native libs (in two containers).

For MacOS cross build, we may need to include MacOS X sdk extracted from Xcode in the docker image. So I guess we cannot distribute the docker image in the open source community.

If so, we need to build the docker image every time when we are going to run the build-release-binaries script.

Yes we will build the docker image as part of the release builder script (Spark release does the same) What I am discovering is that there is no good way to get MacOS sdk extracted from XCode/Developer tools in an automated way. (See https://github.com/tpoechtrager/osxcross/blob/master/README.md#packaging-the-sdk). I'm assuming that the release manager could be doing the release from either mac or linux. Looking to see if there is tarball with the MacOS sdk that is officially available.

parthchandra commented 2 months ago

Deployment to maven This will follow the steps outlined in the official guidance from Apache Infra https://infra.apache.org/publishing-maven-artifacts.html We do not plan to use the maven release plugin, but the rest of the guidance is valid. We simply need to add the deploymentManagement section in the pom file and document the steps for the release manager to set up their credentials.

In Spark, these steps are mostly automated by release script (See the process in https://spark.apache.org/release-process.html#preparing-for-release-candidates).

For example, in Spark repo, dev/create-release/release-build.sh is used to publish snapshot release to Apache snapshots or a release to Apache release repo, etc. I used to do Spark release by running these scripts.

I took a quick look. For publishing releases, it automates the steps described in the Apache Infra doc. We can probably write a release script for Comet.

Yes, we could. That can happen as a next step to automate the release process.

parthchandra commented 2 months ago

Also discovered a problem with cross-compilation for which there is no solution - https://github.com/rust-lang/rust/issues/114411 When building for MacOS on Linux, debug symbols do not get stripped. Build succeeds but the final jar is about 20MB larger.