apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.07k stars 445 forks source link

Modifies pom to allow git sha in version #4930

Closed keith-turner closed 1 month ago

keith-turner commented 1 month ago

Modifies the pom to allow specifying a git sha in the version. When testing with downstream software and using a SNAPSHOT version of accumulo it can be difficult to control what SNAPSHOT you actually get. These changes allow including the git sha in the accumulo version number which removes the suprises around using SNAPSHOT versions.

With these changes can include the sha in the version with the following command.

mvn clean install -PskipQA -Dsha1="-$(git rev-parse --short HEAD)"

For more information see :

https://maven.apache.org/maven-ci-friendly.html

keith-turner commented 1 month ago

Looked into how thiese changes would work with the release pluging and found the following.

https://stackoverflow.com/a/67012045 https://medium.com/@makhaer/replace-maven-release-plugin-with-maven-ci-friendly-1496d63de9ed

Those are saying the release pluging is not needed anymore. For the accumulo release process is the release plugin still needed? Does the release plugin upload artifacts to the apache maven repos?

ddanielr commented 1 month ago

Built these changes locally and tested various build commands.

  1. mvn package -DskipTests -DskipITs produced accumulo-2.1.4-SNAPSHOT artifacts as expected

  2. mvn -Drevision=2.1.4 -Dsha1=$(git rev-parse --short HEAD) clean install -DskipTests -DskipITs produced artifacts with a 2.1.451b8cad81e-SNAPSHOT version and installed all the jars correctly into my local mvn repo.

For building a release version, an empty -Dchangelist= build arg will need to be supplied but from a dev perspective this makes building and testing specific artifacts much easier.

keith-turner commented 1 month ago

or building a release version, an empty -Dchangelist=

I tried running this and the artifacts are shorter. Adding SNAPSHOT when the hash is present does not really seem useful.

mvn clean install -PskipQA -Dsha1="-$(git rev-parse --short HEAD)" -Dchangelist=""
ddanielr commented 1 month ago

or building a release version, an empty -Dchangelist=

I tried running this and the artifacts are shorter. Adding SNAPSHOT when the hash is present does not really seem useful.

mvn clean install -PskipQA -Dsha1="-$(git rev-parse --short HEAD)" -Dchangelist=""

I need to remember to add the hyphen between the git sha and version number for better visibility.

I agree that adding SNAPSHOT doesn't seem useful for the unique artifact use case.

ctubbsii commented 1 month ago

To be clear about my review, I would have to do more testing to fully understand the potential side effects on our release process, and other things, which I have not done. However, I don't find the change compelling regardless. If others do, and my points against doing this are not convincing, I can proceed with additional testing to determine the full impact and potential side-effects of this and re-review. I just don't want to do that extra work if I don't have to (I would rather spend my time elsewhere in the project).

cshannon commented 1 month ago

I saw this and was going to comment that there is no way this change would work with the maven-release-plugin which would be a blocker but I saw @ctubbsii already pointed that out.

This might be a problem better solved from the downstream project side. For example, what about the Versions Maven Plugin? That plugin can do a lot of nice things and I use it a lot and it might be something useful here.

There is the lock-snapshots goal which could probably be used here when building a project that depends on an Accumulo snapshot. That can be run on the downstream project and that would lock the snapshot version of Accumulo that is used to a timestamp.

cshannon commented 1 month ago

I realized that for using the timestamp for snapshots that only works if the dependency you want to use has been deployed. I believe that running the mvn install goal will just copy artifacts with the base SNAPSHOT name appended. The mvn deploy goal will actually append timestamp versions so if the build has not been deployed then the lock-snapshot thing may not work.

keith-turner commented 1 month ago

Many times I have run into a problem like the following trying to test changes on a cluster with accumulo-testing prior to committing those changes upstream.

  1. Create a branch of accumulo
  2. Create a branch of accumulo-testing that depends on a snapshot version of accumulo
  3. Stand up a small test cluster
  4. Build/install accumulo and accumulo-testing on the test cluster from the branches above
  5. Pause test activity for a while
  6. Rebuild accumulo-testing, which builds a shaded jar that includes accumulo jars, and have it pull another snapshot version of accumulo other than the one under test. If this is not noticed it will silently invalidate the test w/o anyone realizing the test is not valid.

What would be really nice is to be able to do the following type of test.

  1. Create specifically named "snapshot" versions of accumulo that include the git hash or some other id that differentiates that from all other snapshot builds.
  2. Push the jars and tarball from the build to a maven repo that is only visible to a ci/cd pipeline
  3. Create a branch of accumulo-testing that will only work with that specific version according to the git hash, if it can not find it should fail.
  4. Spin up automated testing of that specific version of accumulo in a cluster and/or kubernetes and have confidence in the version of accumulo that is being tested.

Would be really nice to be able to do test like this concurrently. For example extensively testing 3 accumulo PRs that are not yet committed. Each PR concurrently running through a distributed ci/cd process that uses accumulo-testing while building shaded jars, docker images, vms etc as part of the testing. It is important to have confidence that the three concurrent distributed test runs tested the expected code and not some other random snapshot version of accumulo.

I do not know if the changes in this PR are the best way to achieve these goals, it was something I found yesterday. I do think its interesting that the maven project thought the changes were important enough to change maven itself. That makes me curious about the full implications of this maven change which I do not understand. These maven ci changes may be a foundational change to the way maven works that is a really nice improvement but can never take off because of the inertia of all the existing plugins. This project should not adopt this new practice if it breaks the plugins involved in the release process.

keith-turner commented 1 month ago

Overall, I think this approach doesn't really solve anything better than how we've already solved it with mavanagaiata and the manifest entries, but where it is useful in the artifact names, a simple patch, release:update-versions or one-liner, or simple script, would do a much better job of changing the version for that specific use case, with far fewer side effects that are likely to cause more problems.

Using release:update-versions may be a good way to solve this. One interesting difference is that running release:update-versions would dirty the workspace which would throw off mavanagaiata because the workspace is dirty. These mvn ci changes allow changing the version produced w/o dirtying the workspace.

ctubbsii commented 1 month ago

Many times I have run into a problem like the following trying to test changes on a cluster with accumulo-testing prior to committing those changes upstream.

During development/testing, I encounter this pattern as well, but I have rarely or never had a problem unintentionally pulling in a SNAPSHOT version other than the one I'm intending. I think this is because I do some things to avoid this:

  1. I never do mvn install except when I need to for this kind of scenario. I always use mvn package or mvn verify instead, which do not install SNAPSHOT builds to my local repository.
  2. When I know I'm going to be using SNAPSHOT dependencies, or when I am finished, I take care to do something like rm -rf ~/.m2/repository/org/apache/accumulo/ to remove any SNAPSHOTs that may have polluted my local repository.
  3. When I have need to mvn install for this kind of thing, I always do so from a working branch that I have made specifically for this purpose, and I always track which branch I'm on using git-prompt.sh so I always know which branch I'm on.

As you point out, the biggest risk here is just not noticing when using the wrong version. So I take these steps to make sure I notice. You could also be more rigorous than this, by taking additional steps, like:

  1. Use a separate dedicated <localRepository> for your experiment with a custom settings.xml file and specify that on the command-line with mvn -s /path/to/experiment/settings.xml
  2. Wrap your mvn command with something specific for your experiment that checks the sha1 in the manifest for any of the SNAPSHOT accumulo dependencies you want pulled from there.
  3. Use a separate git worktree for your experimental branch

What would be really nice is to be able to do the following type of test.

  1. Create specifically named "snapshot" versions of accumulo that include the git hash or some other id that differentiates that from all other snapshot builds.

You can always do this from a local working branch that installs to the local repo. You can always rebase the commits onto your target branch later, using interactive rebase, and exclude the commit that changed the versions. I've done this a few times, and it works out quite well. It's easy to remember to do that rebase, because I always review the full diff before I push anything.

  1. Push the jars and tarball from the build to a maven repo that is only visible to a ci/cd pipeline

There are a few ways you can do something like this... you can publish releases (versions without "-SNAPSHOT") to a staging repository, much like we do for release candidate voting. And then you can drop them instead of publish/release them. However, the staging repo number will change every time you close the repo, so you'd have to update your CI pipeline (I'm excluding CD... we don't do that... we have voting on releases here... no automated continuous delivery). This is because the repo is write-only until it is closed, and after it is closed, it is read-only. So you can't update the contents without creating a new staging repo.

Another way you could do this is to operate your own Maven repo for this purpose, in which you can do whatever you want.

If you want to use the existing SNAPSHOT repository (like we do with our ci-builds.apache.org Jenkins builds), you have to keep "SNAPSHOT" in the name. However, even if you add the sha1 to the version, there is no way to make sure it actually came from that commit, and it would still be possible to publish multiple snapshots with the same sha1 name, with different timestamps... which is kind of similar to the situation we have now, where we have multiple snapshots with the same version at different times, and you can't be sure which is the one you want. This PR doesn't solve that, though.

  1. Create a branch of accumulo-testing that will only work with that specific version according to the git hash, if it can not find it should fail.

We could just modify the accumulo-testing scripts/build to check the manifest of the dependencies automatically if a particular environment variable is set for the build to lock it to that commit, and fail the build or test if it fails to match. This is kind of what is done with fluo-uno, where we verify that the downloaded applications match a particular criteria (checksum or cryptographic hash, in that case).

  1. Spin up automated testing of that specific version of accumulo in a cluster and/or kubernetes and have confidence in the version of accumulo that is being tested.

Would be really nice to be able to do test like this concurrently. For example extensively testing 3 accumulo PRs that are not yet committed. Each PR concurrently running through a distributed ci/cd process that uses accumulo-testing while building shaded jars, docker images, vms etc as part of the testing. It is important to have confidence that the three concurrent distributed test runs tested the expected code and not some other random snapshot version of accumulo.

This use case sounds like using the staging repos would work best for concurrent independent tests on particular snapshots.

I do not know if the changes in this PR are the best way to achieve these goals, it was something I found yesterday. I do think its interesting that the maven project thought the changes were important enough to change maven itself.

I wouldn't read too much into the fact that it was added as an option. Maven, like many projects, tries to accommodate a wide variety of use cases from its users. I don't think that means it's a preferred, or better, option to use for a particular scenario. It may not even be the best option for any scenario. As far as I can tell from briefly looking around, Maven's own code does not take advantage of this feature in any way for itself (but I didn't spend that much time looking).

That makes me curious about the full implications of this maven change which I do not understand. These maven ci changes may be a foundational change to the way maven works that is a really nice improvement but can never take off because of the inertia of all the existing plugins. This project should not adopt this new practice if it breaks the plugins involved in the release process.

I wouldn't say it's foundational... but it's definitely new, and diverges from normal conventions. It certainly would require us to make substantial changes elsewhere if we were to use it instead of the maven-release-plugin.

Using release:update-versions may be a good way to solve this.

versions:set does something similar and is another option if release:update-versions doesn't quite work well for this.

One interesting difference is that running release:update-versions would dirty the workspace which would throw off mavanagaiata because the workspace is dirty.

You could always check in the changes to your experimental branch (as I mentioned above). But mavanagaiata will handle it fine. It will just note in the manifest that the workspace is dirty... you won't know why its dirty (what changes exist that dirtied it), though, so perhaps that's what you meant by "throw off". But with the proposed change in this PR to use git rev-parse, you wouldn't even have that much information... and wouldn't be able to tell whether the workspace was dirty for that build or not. So, the annotated "-dirty" on the hash in the manifest is at least marginally better for that. But, like I previously said, you can always check it in for a clean workspace.


One of the things I was thinking about, since much of the use case arguing for this seems to be centered around CI, and chaining projects, I'd like to point out that it is entirely possible for Jenkins to chain builds with a complex Jenkinsfile. You can write a CI workflow that doesn't rely on deployed SNAPSHOT builds at all, but explicitly checks out a particular experimental branch, builds it, installs the artifacts to a clean local repository as part of a first phase, and then builds the depending project in a second phase. You don't actually have to depend on deployed SNAPSHOTs for a reproducible CI workflow, since the workflow itself can be made to build the dependency at a particular branch/commit.

keith-turner commented 1 month ago

During development/testing, I encounter this pattern as well, but I have rarely or never had a problem unintentionally pulling in a SNAPSHOT version other than the one I'm intending. I think this is because I do some things to avoid this:

I am also very careful w/ snapshot builds. The usual way unexpected snapshots sneak in is because the accumulo-testing pom declares the ASF snapshot repos and this causes snapshot builds generated by the ASF jenkins build server to be pulled.

However it is done, it does not need to be via the changes proposed here, it would be nice to be able to easily produce artifacts with the git hash in the version if possible. Using -SNAPSHOT builds correctly in distributed testing requires that one think of all externalities and ensure those are correctly configured. If there is an externality one did not consider and it pulls in a SNAPSHOT build other than what is desired then a passing test would be a false positive. Including the git hash in the version avoids this problem for practical purposes. Since these changes do not seem to support that goal w/o breaking release plugins I am closing this PR and will pursue other avenues like changing the version in all poms prior to creating artifacts.

ctubbsii commented 1 month ago

will pursue other avenues like changing the version in all poms prior to creating artifacts.

Here's one option:

mvn versions:set -DnewVersion="2.1.4.$(git rev-parse HEAD)-SNAPSHOT" -DgenerateBackupPoms=false &&
  mvn install -PskipQA &&
  git reset --hard # or another call to versions:set to reset the version here

The manifest will show the revision with -dirty appended in the Implementation-Build field, but I think that's okay if the version bump is the only thing that dirtied the workspace. If you want to avoid that, you could commit the version change in your development branch.