AMI stack: improve version numbering

ghost commented 7 years ago

Currently, @petermr manually increments the version number of each piece of ContentMine (e.g. euclid, svg, etc.) before each push.

This creates large scope for:

human error;
resulting dependency hell.

The manual process takes PMR at least half an hour; sometimes longer. This in turn means that PMR rarely pushes more than once per day.

This in turn creates large scope for:

ContentMine developers getting out of sync with each other;
resulting dependency hell.

An additional problem is that it is not entirely clear what any given version increment means.

ghost commented 7 years ago

I think ContentMine should adopt one of the two solutions below. Note that they may be mutually exclusive.

Branched releases

Similar to: Debian or Ubuntu.
Suitable Git workflow: Git Flow.
Ramifications:
- Each ContentMine release must pass all tests (although that doesn't necessarily mean it will be bug-free).
- Each ContentMine release must be tagged using SemVer (e.g. 2.1.3).
- The API/ABI of the various pieces of ContentMine will vary only from major version to major version, as per SemVer.

Rolling release

Similar to: Arch Linux.
Suitable Git workflow: GitHub Flow.
Ramifications:
- Each ContentMine release must pass all tests (although that doesn't necessarily mean it will be bug-free).
- Each ContentMine release must be tagged with the date of the release (e.g. 2017.05.01).
- The API/ABI of the various pieces of ContentMine may vary from release to release.

tarrow commented 7 years ago

We definitely need to make this situation better.

I don't think Peter should be bumping each piece before every push but I think each piece "down the stack" needs to be bumped. i.e. if something at the top is changed then everything needs to be bumped. If something at the bottom is changed then only it needs a version bump. I guess all of this is basically saying that the updates happen because the API/ABI of the libraries has changed and this must correspond with a version bump.

We don't often (ever) find there are changes to euclid that don't change the API/ABI that something further down uses.

I'm currently not quite sure how either of these solutions solves the problem of manually bumping version though. Even if we did go to (e.g.) a rolling release if a meaningful change was made to euclid then everything down the stack would still need to increment in version to today's date (assuming that the change is euclid is used further down the stack in some way).

Is the problem that this bumping of versions needs to be automated?

tarrow commented 7 years ago

So far we were following a SemVer type numbering system. Everything should still have a major version of 0 which implies anything can change at anytime but... we still need each package to point at the right version of a dependency if it does indeed depend on it.

I think I favour github flow. I personally think that having things like a hotfix branch is unnecessary for us.

That said I don't see why using github flow necessitates versioning by date of release. Why can't you just bump up version incrementally (and stick to semver rules if you like)? In fact this is kind of needed if you release > 1x per day.

It's also worth noting that maven has this concept of SNAPSHOT releases. http://stackoverflow.com/questions/5901378/what-exactly-is-a-maven-snapshot-and-why-do-we-need-it

ghost commented 7 years ago

@tarrow wrote:

It's also worth noting that maven has this concept of SNAPSHOT releases. http://stackoverflow.com/questions/5901378/what-exactly-is-a-maven-snapshot-and-why-do-we-need-it

Yes, @petermr and I are considering whether the codebase should return to using Maven snapshots :)

Trouble is, I'm brand new to Maven, and @petermr is more focused on other aspects of the codebase. Hence filing this bug, as something to fix as soon as we have enough knowledge and time to do so.

Fixes and pull requests welcome!

ghost commented 7 years ago

@tarrow wrote:

So far we were following a SemVer type numbering system. Everything should still have a major version of 0

I agree that the ContentMine components should have a major version of 0, if following SemVer. Unfortunately, they don't :(

For example:

euclid's pom.xml shows 2.1.1;
svg's pom.xml shows 1.1.0.

ghost commented 7 years ago

@tarrow wrote:

We don't often (ever) find there are changes to euclid that don't change the API/ABI that something further down uses.

This sounds like pathological coupling.

TODO: I think we ought to fix this asap.

Even if we did go to (e.g.) a rolling release if a meaningful change was made to euclid then everything down the stack would still need to increment in version to today's date (assuming that the change is euclid is used further down the stack in some way). [...]

Fixing the pathological coupling should obviate this assumption, mitigating that concern. I hope!

Is the problem that this bumping of versions needs to be automated?

Possibly, but I doubt it. Here's why.

we need to fix the pathological coupling; and
@petermr needs to stop thinking of every push as a release, and adopt an improved Git workflow instead :) In typical Git and Maven workflows, branching/committing/merging/pushing/etc happen frequently (e.g. many times per day); making and tagging a release happens much less frequently (e.g. once per week, or less).

Once those two factors are in place, it will probably be best for ContentMine to manually (or at most, semi-automatically with Maven's assistance) tag releases using SemVer, because:

it will only need to be done occasionally (e.g. weekly); and
an automated release tagger won't (unless it is very smart) know which part(s) of the SemVer version number to alter.

By "semi-automatically" above, I mean that the actual edits to the pom.xml files could perhaps be made by a script that would be told, e.g. "Bump the minor version number for modules x, y, and z." This would reduce the risk of typos and other human errors. In a Java project using Maven, the right tool for this sort of thing seems to be the Maven release plugin, but if you know of a better tool, do please mention it in a comment below!

ghost commented 7 years ago

@sampablokuper wrote

In a Java project using Maven, the right tool for this sort of thing seems to be the Maven release plugin

Additionally, the Maven dependency plugin provides helpful ways to analyse and troubleshoot dependencies.

ghost commented 7 years ago

Relevant reading:

SemVer support was added to Maven in March 2015.
Some other people in the world use an approach similar to @petermr's (and seem to be seeking better alternatives).
SO question asking about SemVer vs Maven version numbering.
Research showing Maven fails to enforce SemVer but noting a Maven plugin that can assist.
https://maven.apache.org/guides/mini/guide-multiple-modules.html
http://blog.osgi.org/2013/09/baselining-semantic-versioning-made-easy.html
https://stackoverflow.com/questions/17312681/maven-multi-module-project-version-management
https://stackoverflow.com/questions/5726291/updating-version-numbers-of-modules-in-a-multi-module-maven-project
https://stackoverflow.com/questions/8330093/maven-multimodule-projects-and-versioning
http://www.mojohaus.org/versions-maven-plugin/examples/update-child-modules.html
http://www.mojohaus.org/flatten-maven-plugin/examples/example-central-version.html
https://dev.c-ware.de/confluence/display/PUBLIC/Releasing+modules+of+a+multi-module+project+with+independent+version+numbers
https://danielflower.github.io/multi-module-maven-release-plugin/

ghost commented 7 years ago

My present understanding

Our goal here is to use Maven, specifically, so as to be able to choose which dependency sub-tree to clone and build, within a multi-module project.

For example, we might want (e.g. for CI purposes) to build only the svg module and its dependencies.

Maven has, by default, two mechanisms that can be used for this purpose:

Inheritance

Is bottom-up: the child declares its parent.
Must package the parent as pom. This seems to preclude nesting anything whose <package> element's content is not "pom".
Causes the child POM to inherit these (and other, but not all) elements from the parent POM:
- groupId
- version
- description
- url
- inceptionYear
- organization
- licenses
- developers
- contributors
- mailingLists
- scm
- issueManagement
- ciManagement
- properties
- dependencyManagement
- dependencies
- repositories
- pluginRepositories
- build
  - plugin executions with matching ids
  - plugin configuration
  - etc.
- reporting
- profiles

I have highlighted the elements that seem most likely to be useful in the AMI stack.

Aggregation

Is top-down: the parent declares its children.
Must package the parent as pom. This seems to preclude nesting anything whose <package> element's content is not "pom".

Dependencies

Maven's "dependencies" and "dependencymanagement" mechanisms are largely independent from Maven's inheritance and aggregation mechanisms.

Archetypes

To check whether my understanding of multi-module Maven projects was in line with that of experienced Maven users, I generated, on my local PC, an instance of each of the 13 Maven archetypes that describes itself as multi-module:

4 99 245 390 435 437 438 503 890 1238 1241 1324 1690

These ~~all~~mostly follow the same structural pattern:

project
|-pom.xml          (packaged as "pom")
|
|-module1
| |-pom.xml
| |-module1-files   (packaged as "jar"/"war"/etc)
|
|-module2
| |-pom.xml
| |-module2-files   (packaged as "jar"/"war"/etc)
|
...

As I expected, given my findings above about Maven inheritance and aggregation, none of them nest anything that is packaged as a JAR or WAR, etc.

This structure might look familiar: it is also given in the Maven introduction.

If you want to see a multi-module project with nested parent POMS, here is an example.

Git submodules

The AMI stack appears to be a suitable use-case for Git submodules:

one super-project
many sub-modules
dependencies between the submodules may change, and may not even be acyclic
we should be able to clone only the submodules we need for each use-case (e.g. CI builds)

Maven SCM plug-in

Maven's SCM plug-in is, in theory, capable of interacting with Git, and of performing clone/pull/etc operations. ~~Whether it is sophisticated enough to work with Git submodules, I do not know. I am looking into this.~~ At time of writing (early June 2017), the SCM plug-in has an unresolved feature request asking for Git submodules to be supported.

Submodule support shouldn't be necessary, perhaps, as long as Maven is capable of cloning from the correct remote for each module into a suitable folder, which it should be able to.

Wildcat plugins

Many maverick Maven developers release Maven plugins outside the Apache project. Some potentially good options for the AMI use-case are:

Alternatively, one can just tell Maven to invoke Git directly: https://stackoverflow.com/a/29796988/

ghost commented 7 years ago

@petermr, I am making progress on this, but clearly could use a bit more time, and the book.

I hope to have a firm set of recommendations available for you by the end of the week. In the meantime, here are my preliminary recommendations, assuming the AMI stack is to be considered a multi-module project:

Use SNAPSHOTs for all the modules in the AMI stack, at least until ContentMine staff have mastered Maven's release plugin.
Use Maven's aggregation mechanism, in the form of a <modules> element in cm-pom, to declare each module of AMI.
Use Maven's inheritance mechanism, in the form of <parent> elements in the modules' POMs, to keep the latter DRY.
Nested parent POMs would be appropriate if any proper subset of AMI's modules is to be considered a module in its own right. Avoid nesting parent POMs otherwise.
Evaluate convention for laying out AMI stack on-disk. Currently, AMI developers seem to clone each module separately, with disregard to their relative paths. That has both advantages and disadvantages but probably more of the latter. Instead, developers should probably follow the structural pattern diagrammed in the previous comment in this thread, or that given by the linked nested parent POMs example.
Evaluate the SCM plugin vs the other alternatives given in the previous comment in this thread, in conjunction with mvn -pl <project> -amd, to facilitate CI.

ghost commented 7 years ago

Re-opening, as issue has not been resolved and seems to have been closed erroneously. (Cf. https://github.com/ContentMine/meta/issues/5#issuecomment-308723497 .)

ghost commented 7 years ago

As people reading this may know, I moved on to other work before making the firm recommendations mentioned above. Anyhow, in the interest of tying up loose ends, here are some potentially useful pointers.

First, here is a table of reasonable ways to manage multi-module Maven projects, and their implementation requirements (short of shell scripts or other ad-hoc solutions).

Git: repo per module or per project	Maven submodules: beneath or level with parent module	CI dependency handling requires Git submodules or subtrees	CI dependency handling requires a persistent Maven repository
Module	Beneath	1	0
Module	Level	0	1
Project	Beneath	0	0
Project	Level	0	0

The third and fourth options seem to be preferred by professional Maven users. See, for instance, this book, or this example (which was also referenced in a previous comment). Unfortunately, the size of the Norma and AMI modules makes the third and fourth options unappealing for the AMI stack, as developers wishing to work only on the smaller modules in the stack would be forced to download the larger ones (even if they only stored the desired subdirectories).

Last time I checked, ContentMine used the second option, which was proving difficult to extend to the whole AMI stack. Extending it to the whole stack might be facilitated by using Travis's caching facilities. Alternatively, moving to GitLab would make GitLab CI's artifacts facilities available, which might be more amenable to ContentMine's needs than Travis's caching facilities are.

Finally, what about the first option? I'm afraid I won't have time in the immediate future to work much more on it myself (unless contracted to do so), but I have made some headway. See the "refactor-maven-as-multimodule" branches, where present, in my clones of the AMI stack modules, e.g. https://github.com/sampablokuper/cm-pom/commits/refactor-maven-as-multimodule to get a sense of direction. (Please note that I consider those branches to be fluid and may force-push to them without notice.)

ContentMine / meta