21cmfast / 21cmFAST

Official repository for 21cmFAST: a code for generating fast simulations of the cosmological 21cm signal
MIT License
58 stars 37 forks source link

Decision on branching/versioning #175

Closed steven-murray closed 3 years ago

steven-murray commented 3 years ago

This issue is here to help us discuss and decide (at least initially) on a way to manage version releases of the code, and, very much related, how to setup our branching.

Firstly, the only thing I'm really firm on is that we should be following semantic versioning. In brief, this means that any released version should be called vX.Y.Z, where X is a major version and is updated if any of the changes introduced are breaking changes. Y is a minor version which is updated if any new (non-breaking) features are introduced (either physical or code features), and Z is a patch version which is updated if a bug fix is introduced.

One thing to keep in mind is that whatever we decide on we can change in the future, if we find it's not suiting us (except for the semantic versioning!)

There are a few different kinds of thing we need to decide on, and they all kind of affect each other. Let me lay out some of these options for different categories, and then I'll try to present a few wholistic plans that we can perhaps vote on.

Components To Decide On

Version Storage Location

Currently, the version itself is saved as a string in the main __init__.py as __version__, and is manually updated whenever a new version is created. Most places that need to refer to the version can programmatically access that version, so that there's only one number to update, with two exceptions: (1) the CHANGELOG, which also has to be manually updated, and (2) the git tag for the version.

Another alternative is to use setuptools_scm, which enables setting the version based on the latest git tag. That way, the definitive version of the code is in the git tag itself, and the __version__ in __init__ just knows how to access that. This takes away one of the exceptions from the current method. This method also has some other benefits -- the version can be extended to contain information about how far the current code is from that version exactly, and can contain the current git hash. This aids reproducibility.

To be succinct, the two options here are:

  1. String in __init__.py
  2. Git tag via setuptools_scm.

Version Bumping Tool

As I already said, currently we manually change the string version in __init__.py to change the official version. This has a number of pros/cons, eg. it can be easy to get the version "wrong" (update the wrong number). But it's also more flexible, in that if we decide that a certain version is to be set, we can do it directly. Since the update is in the codebase itself, it also means that a whole PR must be made, reviewed, and merged before the version is updated. This could be seen as a pro or a con!

Some alternatives here are:

  1. Manual updates (current)
  2. Manual git tag updates (if using setuptools_scm)
  3. Use bump2version to automatically update either the __init__.py and/or the git tag (should also be able to configure it updating the changelog, I think). Note that this is still manual in the sense that we have to run bumpversion minor or something like that (and have to decide on which semantic part to update).
  4. Use python-semantic-release. I think it only works with the string-in-__init__ kind of version location, but it has the advantage of deciding for you which part of the version to update, and updating your changelog etc.

Another thing to decide in this category (but this is highly related to what kind of branching structure we adopt, see below), is "how automatic" we want the version updates to be. It is possible to make them essentially fully automatic, by using Github Actions and setuptools_scm. Let me list some options for automatic-ness:

  1. Not at all automatic: manually update the string version (or with bumpversion) in a particular branch, and let someone review and merge the PR, then manually deploy to PyPI. Cons: can easily forget to deploy to PyPI, takes another review/test cycle. We cannot do this directly on master since master is protected on github.
  2. Semi-automatic: use something like python-semantic-release inside a GitHub action to automatically create a new PR that updates the relevant version digit, and updates the CHANGELOG etc. based on the commit history (see caveats below).
  3. Fully automatic: use a GitHub action to detect the next version number based on the commit history, and make a git tag based on it, which, if using setuptools_scm is enough to create the new version (and can trigger PyPI deploy). Cons here (other than the caveat of relying on the commit history) is that it won't update the CHANGELOG, so all the current changes will still be under the dev heading. Maybe this is OK, and we just update the CHANGELOG on the next commit, but it's still a bit annoying. Also, unless we specify a kind of interesting branching setup, this will happen on EVERY PR to master, which means the version will bump on every PR. This may or may not be desirable.

Some notes on the commit history: the only real way to automatic version bumps (i.e. to know which digit to bump) is to go off the commit history. This requires everyone using a specific commit message template. I am actually very much in favour of this in general -- I think using a specific convention for commits makes searching and reading the commit history nicer as well. A popular convention is the AngularJS convention. The convention can be somewhat enforced on all developers by using commitizen, but I've never used it. The cons of this approach that I can see are (1) a little extra overhead for development and (2) if the conventions are not obeyed properly, then an automatic update will be both wrong and we may not be able to intervene.

Deployment Method (to PyPI)

Currently, the method for deployment is that I manually run python setup.py sdist; twine upload on my own computer whenever the version is manually bumped. It's easy to forget to do this.

It's fairly easy to setup up automatic deployment to PyPI using Github Actions, however the trick is deciding when to deploy and how to trigger that. This will depend a lot on our branching and choices for automation above. Some options:

  1. Keep the same (i.e. manual upload from one person's computer). Cons: could forget, version manager could move on to other things...
  2. Trigger upload on any git tag push to master: I think this might only work with setuptools_scm, though perhaps we could get it to work with bump2version. I guess it could work with any system so long as we also do a git tag after a version update. This would remove the second con of (1), but not the first. Also adds the ability to have the git version be inconsistent with the actual version in the __init__.py if not using setuptools_scm or some other automatic version tool.
  3. Trigger deploy on PR merge to master ONLY from some specific branch pattern (eg. branches named vX.Y.Z, or release-XXX). That would mean we'd have to stick to a certain kind of branching scheme, but I think everything else could be automated.
  4. Trigger deploy on any merge to master, but have branches set up so that only happens when we want it to (see below).

Note that usually deployment to conda-forge is automatic based on deployment to PyPI.

Changelog Update

We keep a CHANGELOG to let users know what's changed in each version. Currently, this is manually managed, and we try to make developers remember to update the changelog on every PR, adding what's changed in that PR under the dev heading. Then, when it comes time to update the version, that dev gets changed to the version number, and a new empty dev section is added. Note that since doing this means changing actual files in the repo, it must be done in a PR (either a PR dedicated to cutting a release, or just in the final PR that's going into the release). I don't see any real way around this, which is why the "fully-automated" idea of version bumping presented above doesn't quite work.

There are however some options for compiling the changelog automatically (still has to be checked in a PR) from the commit history. For example, python-semantic-release does this for you. This would mean that developers don't have to remember to update the changelog in each PR, they just have to write decent commit messages. Either way, I guess we get to have final say on the output before merging in the PR.

Release Timing Strategy

A lot of what we decide will depend on how we want to time releases. I can think of three options:

  1. Always release whenever we do any kind of update, as soon as possible (i.e. every merged PR). Pros: really simple, and users get the benefits of updated code ASAP. Cons: we could go up in versions really quickly (even major versions), which looks bad (like the code isn't mature and stable), but is also annoying for users, because they have to re-write their code any time a major version update is done. It's nicer as a user to get a whole bunch of updates that aren't backwards incompatible, then ONE big backwards incompatible change that they can deal with. In general, it's perfectly fine for minor/patch versions to be really high (even better for patch versions), but you want the major version to be pretty low.
  2. Release based on pre-determined sets of Milestones for major and minor versions (not patches). I.e., we as a group decide that "these features will make it into v3.1", and then when those features are implemented, we cut that release. In between these releases, we can release as many hotfix patches as we like. This is especially important for major releases. In this case, there's no set time of release, just whenever we get those things done.
  3. Time-based release: release each kind of update on a certain cadence (eg. minor versions every three months, and major versions once a year), and whatever is done then is merged. This has the pro of giving users a kind of definite time-scale for stability. It has the con that none of us are really full-time on this, and updates don't necessarily happen with this kind of regularity.
  4. Mix of 2 and 3. We could do something like "time of minor release is at least 3 months, but must be after this Milestone is completed". I actually think this would be rather successful, especially if patches have no timescale (i.e. fixes are put out as soon as possible).

Branching

I've already referred many times to branching, because how we manage branches is going to have a big impact on our other choices. There are two big streams of thought for branching (and lots in between): continuous-delivery (or the "GitHub Flow" model) or the git-flow model. The first is much simpler (and essentially what we're doing now), while the second is more complicated (perhaps too complicated for us). However, while the first is really useful for web-development, where no other libraries are depending on yours, and you just want to get changes out and visible as quickly as possible, the second is really useful for libraries on which other code depends. We could argue that 21cmFAST is somewhere between -- it has library-like components, and certainly other people's code will rely on it, but it's really used more as an application.

There are a few considerations for branching. First, there is the possibility to "protect" branches in GitHub, which means that code will never be merged into that branch without passing tests and other checks (possibly reviews). The protected branches can be arbitrary (and can be set based on patterns, so eg. vX.Y.Z could be protected). Currently, we protect master like this. This makes master essentially "always correct". For some of the above choices of automatic deployment etc., we could also make master always correspond to the latest deployed version on PyPI. This may or may not be useful (pros are that if someone installs directly from github, they're gauranteed to have a "release", not something in-between. But if someone is doing that, it's likely that they might be looking for something slightly more bleeding-edge).

Secondly, on GitHub, merging branches is an opportunity to insert a GitHub Action that can "do stuff", like deploying, bumping versions, etc. We can try use that to our advantage.

Thirdly, the reason to use branches at all is to develop new code in isolation, and have the ability to (1) discard the new changes before it affects anything else, and (2) control the timing of the inclusion of that code into the "official" branch. Some interesting scenarios pop up in this regard. For example, if one has finished work on a "feature" (say it's not API breaking, so inclusion should essentially bump the minor version), and merges it into "master" (but doesn't actually bump the version and deploy yet -- so users don't get the change, since we're waiting for a few new features to be added before bumping), but then we notice a bug that we really should fix straight away, then we can't merge that to master and deploy without also deploying the feature. So instead of a nice little patch update, we're getting a full minor update, before we possibly intended to. This is compounded in the case that someone is working on an API-breaking update.

Clearly, the choice of branching is going to depend on the choice of release timing strategy a lot.

Here's a few options:

  1. GitHub-Flow: protect only master, and let all features be developed branches that are merged directly to master. Pros: simple, no change from current method, all updates are quick to deploy. Cons: no way to manage releases (i.e., can't do 2-4 of the release timing strategy options).
  2. git-flow: protect master and develop. All feature branches are branched off develop, and merged back there directly. As many PRs as we want in a version can be merged, and then a vX.Y.Z branch (or something similar) is created off of develop, which is purely for doing last-minute stuff before release (eg. bumping the version, fixing the changelog). This is merged into master, and deploy actions are automatic off of that branch. Bugfixes are branched off master directly, and merged back into there. Pros: not too complex, most automation is enabled, can combine minor feature updates into one release. Cons: no way to isolate major updates (i.e. once a major update is merged to develop, the next release must be a major update, barring patches).
  3. Protect master and any branch named vX.Y. At any time, there may be multiple vX.Y branches (eg. v3.1, v3.2 and v4.0). Any feature is set to merge to which version it is milestoned to be merged to. Later versions constantly rebase onto earlier versions, and must be merged after earlier versions (of course). Patches can be directly merged to master. Pros: not too complex, most automation is enabled, enables blocking on major versions. Cons: may be a little harder to maintain than the git-flow strategy, since there's not a central location to rebase on (i.e. develop) for feature branches. Also, since the version branches are protected, one can't directly make the final bump-version kinds of changes to the branch (you'd have to create another PR that does that, to be merged into vX.Y which then gets merged to master). Noting that each merge requires testing (which takes a while), this might not be desirable.
  4. A slight extension of (1), in which while PRs are merged directly to master, they are not cut for release straight away. Instead, we let PRs accumulate until we think everything's done, and then make a vX.Y PR to do the final version-bumping stuff. Pros: really simple, automation enabled by matching to the branch name, can accumulate changes before release. Cons: can't get around fact that once a minor update is in, hotfixes can't be released without increasing the minor version. Thus it will result in larger version numbers OR slower bugfix updates.

Note that in all approaches in which one creates a specifically-named release branch, full automation of updating the version number is not possible -- at least, I don't know of a code that will automatically create a new branch/PR based on the automatically-detected next number. I guess it could still be done, especially with the new GitHub API. One could still use something like python-semantic-release to guess the next version number, and then manually make that branch.

I think I'll leave the issue there for now, and let people discuss this at a high level. Soon, I'll put forward some particular scenarios combining each of these options. Especially looking for input from @andreimesinger, @qyx268 and @BradGreig.

qyx268 commented 3 years ago

I had a bit thoughts. While one can argue on both side that increased automation can lead to more or less freedom, I, in general, prefer maximizing it...

Therefore, I vote for setuptools_scm and bumpversion. I'm not sure if we want to use python-semantic-release as for CHANGELOG, I do prefer updating it manually as git commit often contain unnecessary contents in particular for new users and we probably don't want them to update CHANGELOG as this is viewed by other users.

Perhaps there is a way to cherry pick content in git log when updating the CHANGELOG using python-semantic-release, but then it might just be like copying paste to CHANGELOG. There is anther reason that I prefer CHANGELOG to be written manually preferably by admin who approve the PR. I don't think it worth outlining too many details in it as git log should already serve this purpose and CHANGELOG should be concise. Then it is the same for PyPI and releasing, I believe this is the admin's responsibility rather than usual developer.

I think perhaps having a master_dev for developers is a better option?

steven-murray commented 3 years ago

Both @BradGreig and @qyx268 agree that using essentially the git-flow branching model, in which version updates are done specifically through a dedicated PR is the best idea. Probably using setuptools_scm is better, just because it gives a more explicit version. Trying to use bump2version is a good idea if possible.

As for CHANGELOG, we should update it manually. Which means we should always encourage devs to update the changelog in their PRs. Should be on a checklist so we remember.

Someone should check out commitizen in practice to see if it would be too much overhead. If it's not too much overhead, it will ave two benefits: 1) nice commits, 2) the ability to semi-automatically set the version update.

I will try to write up a specific "manual" for how this should work, and see if everyone agrees on details.

steven-murray commented 3 years ago

So, here's how I think it would look, based on this discussion.

Firstly, we create a develop branch, which is protected in the sense that we can't push directly to it, and all features must be made in PRs that must pass tests/coverage/other checks before merging to develop. Note that API-breaking features must NOT be merged to develop unless it has been decided that the next version is to be a major version update. They can sit and wait as PRs until such time as they are ready. To make a new minor release would then look like this:

  1. The admins decide all necessary features have been added for this particular release, and create a release/vX.Y branch off of develop.
  2. No new features may be merged to develop at this point, until the release branch is merged.
  3. Bugfixes may be directly merged into the release/vX.Y branch.
  4. The admins go through a checklist of items to do in the release branch:
    1. Ensure the changelog looks right.
    2. Run bumpversion minor, which should update the CHANGELOG version, and also make a git tag, which updates the package version via setuptools_scm.
  5. Merge the PR to master. Github Actions will then automatically deploy the code to PyPI.

Any bugfixes that are urgent can be made in fix/fix-name branches, and merged directly to master. I can probably set up a Github Action that automatically updates the version number in this case, and deploys (since we know the version to update is the patch version).

Somehow, we then also have to merge master back into develop. This is hard because develop is protected. It would be nice if it could happen automatically -- I'll look into it.

Note that since the versioning happens via git tag, if anything goes wrong and we have to manually update the version on master, we can. This will also auto-trigger deployment.

steven-murray commented 3 years ago

Just a note that using bumpversion is really out of the question, since it requires updating text in the setup.cfg at the very least. This means we have to do it before merging, which means the git tag would point to the pre-merged commit, rather than the merged one.

steven-murray commented 3 years ago

There's a few things that don't quite work in the above scheme:

  1. The git tag must be made on master directly. However, there's no way for master to know what kind of version bump to make, so the tag must be done locally and then git push --tags. This is annoying, but I can't find a way around it. There's no way for Github Actions to know where the merge is coming from (which is crazy!) so it can't figure out what kind of bump to do based on the branch name. I can't think of any other information that it could use, other than the commit messages, which we already decided weren't stable enough to let it happen automatically from them. So, manual it is.
  2. The way to get a release into both master and dev is just to make two PRs. That can be done manually, but I also may be able to make it semi-automatic.

One more consideration: if the "main" branch we work off is dev, we should probably make that the "default branch". Another option is to use master as our default branch, acting like dev, and make a production branch or something like that, which would act like master.

qyx268 commented 3 years ago

In this case, let's do master + production branch then.