aws / random-cut-forest-by-aws

An implementation of the Random Cut Forest data structure for sketching streaming data, with support for anomaly detection, density estimation, imputation, and more.
https://github.com/aws/random-cut-forest-by-aws
Apache License 2.0
210 stars 33 forks source link

Add a CD pipeline that releases this project to maven #303

Closed dblock closed 2 years ago

dblock commented 2 years ago

Currently this project is built on a developer machine, and occasionally pushed to maven.

  1. Add a CD process that publishes a -SNAPSHOT to a snapshot maven repo with every commit.
  2. Add a CD process that publishes a release to maven central on demand.
dblock commented 2 years ago

This will close https://github.com/aws/random-cut-forest-by-aws/issues/301

kaituo commented 2 years ago

Discussed with DB, he suggested:

AD references X.Y-SNAPSHOT in its repo during development (picks up latest code), ensuring that the changes in RCF are constantly tested and bugs caught early.

ylwu-amzn commented 2 years ago

Should we use similar way of OpenSearch to publish RCF to maven? I think it will be good if OpenSearch build workflow can cover RCF. If we shouldn't include RCF in OpenSearch build workflow, we should confirm current RCF manual publish workflow and if we have dedicated sonatype space for RCF or any other common space we can use.

dblock commented 2 years ago

RCF is not part of opensearch-project, so I don't think it should use any OpenSearch project infrastructure unless you want to move the repo into that organization

I would follow https://docs.github.com/en/actions/publishing-packages/publishing-java-packages-with-gradle

ylwu-amzn commented 2 years ago

Thanks @dblock , do you think we should follow the same practice of OpenSearch: publish pre-release RCF to sonatype repo first. Once test done, we can publish to public maven(maven central).

dblock commented 2 years ago

Thanks @dblock , do you think we should follow the same practice of OpenSearch: publish pre-release RCF to sonatype repo first. Once test done, we can publish to public maven(maven central).

Yes, that's what most projects do, right? Although on a second thought what kind of testing would one do with the pre-release? Is staging really needed?

ylwu-amzn commented 2 years ago

From AD experience, RCF added TRCF and bumped version to 2.0, then we use local jar file to test TRCF in AD and found some bugs, then RCF fixed these bugs and back to AD. We did several iteration to reach a stable version. I think we should follow the same practice of OpenSearch and publish RCF to sonatype repo first. Once the new RCF version tested in AD and other places like MLCommmons, we can publish to public maven.

dblock commented 2 years ago

@ylwu-amzn My 0.02c is that this feels like the library is using AD testing to find bugs and in the future unnecessarily holds this library back. Thus, that's probably not a good long term strategy.

The story above could have been "AD was using 2.0-SNAPSHOT, found and reported a bunch of bugs which were fixed + tests, then eventually random-cut-forest-by-aws 2.0 was released, then more bugs were found that were fixed + tests, then 2.1 was released".

ylwu-amzn commented 2 years ago

Yes, the story much like SQL and MLCommons plugin, we publish MLCommons 1.3-SNAPSHOT first, then SQL will consume this 1.3-SNAPSHOT to develop new PPL ML command and test. The difference is we have same release date for all plugins and OpenSearch core. So we can hold all -SNAPSHOT on sonatype first, once the IT passed and OpenSearch bundle test done, we can release OpenSearch and publish to public Maven.

For RCF, as this is not part of OpenSearch, maybe we can follow different release timeline and decouple the testing. RCF should test and make sure all new features work as expected first, then they can choose to publish as -SNAPSHOT or final public version. We don't need to test RCF new features in AD or MLCommons. If AD or MLCommons side found any bug, we can report to RCF and ask for a fix and new version, just like using any other external lib.

But I'm totally fine if RCF team think some feature/change is too big and we'd better test -SNAPSHOT version in AD or MLCommons first.

dblock commented 2 years ago

I think random-cut-forest-by-aws should be publishing -SNAPSHOT builds with every commit so that those projects that depend on it can do early testing. Everything else seems like overhead.

dblock commented 2 years ago

https://github.com/aws/aws-xray-sdk-java/blob/master/.github/workflows/release-build.yml

ylwu-amzn commented 2 years ago

https://github.com/aws/aws-xray-sdk-java/blob/master/.github/workflows/release-build.yml

That's a good example. @amitgalitz , I think we can use the same way. @dblock do we need security review?

amitgalitz commented 2 years ago

Thank you! Really good find! This is pretty similar to this guide. https://docs.github.com/en/actions/publishing-packages/publishing-java-packages-with-maven The main blocking point right now was related to the security and how best to store/fetch the current credentials. Also this is an action that is manually triggered to publish to maven central which addresses one of the goals. I'll make sure to see if its as simple as changing the trigger event and some other condition to also add a workflow to publish a -SNAPSHOT on merge.

sean-zheng-amazon commented 2 years ago

Some quick updates:

  1. we've done a review with security team about putting credential onto github, and got approved.
  2. there were some discussion between doing this with maven vs converting the deployment script to gradle to follow the same practice as all other opensearch plugins.
  3. we decided to implement it in maven first, meanwhile discuss with the group on the feasibility/benefit of gradle option
  4. we'll publish a PR for the maven solution soon.
sudiptoguha commented 2 years ago

PR 334 merged