ContentMine / meta

A repository in which to file and fix meta issues (issues affecting more than one ContentMine repo or project)
0 stars 0 forks source link

AMI stack: improve version numbering #1

Open ghost opened 7 years ago

ghost commented 7 years ago

Currently, @petermr manually increments the version number of each piece of ContentMine (e.g. euclid, svg, etc.) before each push.

This creates large scope for:

The manual process takes PMR at least half an hour; sometimes longer. This in turn means that PMR rarely pushes more than once per day.

This in turn creates large scope for:

An additional problem is that it is not entirely clear what any given version increment means.

ghost commented 7 years ago

I think ContentMine should adopt one of the two solutions below. Note that they may be mutually exclusive.

Branched releases

Rolling release

tarrow commented 7 years ago

We definitely need to make this situation better.

I don't think Peter should be bumping each piece before every push but I think each piece "down the stack" needs to be bumped. i.e. if something at the top is changed then everything needs to be bumped. If something at the bottom is changed then only it needs a version bump. I guess all of this is basically saying that the updates happen because the API/ABI of the libraries has changed and this must correspond with a version bump.

We don't often (ever) find there are changes to euclid that don't change the API/ABI that something further down uses.

I'm currently not quite sure how either of these solutions solves the problem of manually bumping version though. Even if we did go to (e.g.) a rolling release if a meaningful change was made to euclid then everything down the stack would still need to increment in version to today's date (assuming that the change is euclid is used further down the stack in some way).

Is the problem that this bumping of versions needs to be automated?

tarrow commented 7 years ago

So far we were following a SemVer type numbering system. Everything should still have a major version of 0 which implies anything can change at anytime but... we still need each package to point at the right version of a dependency if it does indeed depend on it.

I think I favour github flow. I personally think that having things like a hotfix branch is unnecessary for us.

That said I don't see why using github flow necessitates versioning by date of release. Why can't you just bump up version incrementally (and stick to semver rules if you like)? In fact this is kind of needed if you release > 1x per day.

It's also worth noting that maven has this concept of SNAPSHOT releases. http://stackoverflow.com/questions/5901378/what-exactly-is-a-maven-snapshot-and-why-do-we-need-it

ghost commented 7 years ago

@tarrow wrote:

It's also worth noting that maven has this concept of SNAPSHOT releases. http://stackoverflow.com/questions/5901378/what-exactly-is-a-maven-snapshot-and-why-do-we-need-it

Yes, @petermr and I are considering whether the codebase should return to using Maven snapshots :)

Trouble is, I'm brand new to Maven, and @petermr is more focused on other aspects of the codebase. Hence filing this bug, as something to fix as soon as we have enough knowledge and time to do so.

Fixes and pull requests welcome!

ghost commented 7 years ago

@tarrow wrote:

So far we were following a SemVer type numbering system. Everything should still have a major version of 0

I agree that the ContentMine components should have a major version of 0, if following SemVer. Unfortunately, they don't :(

For example:

ghost commented 7 years ago

@tarrow wrote:

We don't often (ever) find there are changes to euclid that don't change the API/ABI that something further down uses.

This sounds like pathological coupling.

TODO: I think we ought to fix this asap.

Even if we did go to (e.g.) a rolling release if a meaningful change was made to euclid then everything down the stack would still need to increment in version to today's date (assuming that the change is euclid is used further down the stack in some way). [...]

Fixing the pathological coupling should obviate this assumption, mitigating that concern. I hope!

Is the problem that this bumping of versions needs to be automated?

Possibly, but I doubt it. Here's why.

  1. we need to fix the pathological coupling; and
  2. @petermr needs to stop thinking of every push as a release, and adopt an improved Git workflow instead :) In typical Git and Maven workflows, branching/committing/merging/pushing/etc happen frequently (e.g. many times per day); making and tagging a release happens much less frequently (e.g. once per week, or less).

Once those two factors are in place, it will probably be best for ContentMine to manually (or at most, semi-automatically with Maven's assistance) tag releases using SemVer, because:

By "semi-automatically" above, I mean that the actual edits to the pom.xml files could perhaps be made by a script that would be told, e.g. "Bump the minor version number for modules x, y, and z." This would reduce the risk of typos and other human errors. In a Java project using Maven, the right tool for this sort of thing seems to be the Maven release plugin, but if you know of a better tool, do please mention it in a comment below!

ghost commented 7 years ago

@sampablokuper wrote

In a Java project using Maven, the right tool for this sort of thing seems to be the Maven release plugin

Additionally, the Maven dependency plugin provides helpful ways to analyse and troubleshoot dependencies.

ghost commented 7 years ago

Relevant reading:

ghost commented 7 years ago

My present understanding

Our goal here is to use Maven, specifically, so as to be able to choose which dependency sub-tree to clone and build, within a multi-module project.

For example, we might want (e.g. for CI purposes) to build only the svg module and its dependencies.

Maven has, by default, two mechanisms that can be used for this purpose:

Inheritance

I have highlighted the elements that seem most likely to be useful in the AMI stack.

Aggregation

Dependencies

Maven's "dependencies" and "dependencymanagement" mechanisms are largely independent from Maven's inheritance and aggregation mechanisms.

Archetypes

To check whether my understanding of multi-module Maven projects was in line with that of experienced Maven users, I generated, on my local PC, an instance of each of the 13 Maven archetypes that describes itself as multi-module:

4 99 245 390 435 437 438 503 890 1238 1241 1324 1690

These allmostly follow the same structural pattern:

project
|-pom.xml          (packaged as "pom")
|
|-module1
| |-pom.xml
| |-module1-files   (packaged as "jar"/"war"/etc)
|
|-module2
| |-pom.xml
| |-module2-files   (packaged as "jar"/"war"/etc)
|
...

As I expected, given my findings above about Maven inheritance and aggregation, none of them nest anything that is packaged as a JAR or WAR, etc.

This structure might look familiar: it is also given in the Maven introduction.

If you want to see a multi-module project with nested parent POMS, here is an example.

Git submodules

The AMI stack appears to be a suitable use-case for Git submodules:

Maven SCM plug-in

Maven's SCM plug-in is, in theory, capable of interacting with Git, and of performing clone/pull/etc operations. Whether it is sophisticated enough to work with Git submodules, I do not know. I am looking into this. At time of writing (early June 2017), the SCM plug-in has an unresolved feature request asking for Git submodules to be supported.

Submodule support shouldn't be necessary, perhaps, as long as Maven is capable of cloning from the correct remote for each module into a suitable folder, which it should be able to.

Wildcat plugins

Many maverick Maven developers release Maven plugins outside the Apache project. Some potentially good options for the AMI use-case are:

Alternatively, one can just tell Maven to invoke Git directly: https://stackoverflow.com/a/29796988/

ghost commented 7 years ago

@petermr, I am making progress on this, but clearly could use a bit more time, and the book.

I hope to have a firm set of recommendations available for you by the end of the week. In the meantime, here are my preliminary recommendations, assuming the AMI stack is to be considered a multi-module project:

ghost commented 7 years ago

Re-opening, as issue has not been resolved and seems to have been closed erroneously. (Cf. https://github.com/ContentMine/meta/issues/5#issuecomment-308723497 .)

ghost commented 7 years ago

As people reading this may know, I moved on to other work before making the firm recommendations mentioned above. Anyhow, in the interest of tying up loose ends, here are some potentially useful pointers.

First, here is a table of reasonable ways to manage multi-module Maven projects, and their implementation requirements (short of shell scripts or other ad-hoc solutions).

Git: repo per module or per project Maven submodules: beneath or level with parent module CI dependency handling requires Git submodules or subtrees CI dependency handling requires a persistent Maven repository
Module Beneath 1 0
Module Level 0 1
Project Beneath 0 0
Project Level 0 0

The third and fourth options seem to be preferred by professional Maven users. See, for instance, this book, or this example (which was also referenced in a previous comment). Unfortunately, the size of the Norma and AMI modules makes the third and fourth options unappealing for the AMI stack, as developers wishing to work only on the smaller modules in the stack would be forced to download the larger ones (even if they only stored the desired subdirectories).

Last time I checked, ContentMine used the second option, which was proving difficult to extend to the whole AMI stack. Extending it to the whole stack might be facilitated by using Travis's caching facilities. Alternatively, moving to GitLab would make GitLab CI's artifacts facilities available, which might be more amenable to ContentMine's needs than Travis's caching facilities are.

Finally, what about the first option? I'm afraid I won't have time in the immediate future to work much more on it myself (unless contracted to do so), but I have made some headway. See the "refactor-maven-as-multimodule" branches, where present, in my clones of the AMI stack modules, e.g. https://github.com/sampablokuper/cm-pom/commits/refactor-maven-as-multimodule to get a sense of direction. (Please note that I consider those branches to be fluid and may force-push to them without notice.)