Proposal: Releasing Versions of the Whole-Cell E. coli Model

CovertLab / wcEcoli

Whole Cell Model of E. coli

Other

18 stars 4 forks source link

Tracking Versions

Version Numbers

We use semantic versioning for our version numbers, except we drop the patch number. In broad strokes, this means that our version numbers take the form major.minor, for example 1.0. We can also specify pre-releases like 1.0-beta.1. For any versions that include breaking changes (i.e. if someone else wrote code that uses our public methods, that code should still work), we increment the major version. For all other (i.e. backwards-compatible) changes, we increment the minor version. New major releases will generally go along with papers, while new minor releases will usually contain minor bug fixes.

Commits and Tags

When we release the model, we usually squash all our commits into a single release commit. This keeps the commit messages in wcEcoli private. However, this does not mean have one commit per release. For example, we might add commits to fix bugs or update documentation without doing a new release. You also might want to split the release for your paper across multiple commits. For example, if some of your data were generated using an earlier version of the model, you might want to include 2 commits: one that includes changes up to that earlier version and one for the rest of the changes. That way, you can refer to your versions my commit hashes in your paper.

Instead of tracking versions with commits, we track them with tags. These tags are named with the version number, e.g. v1.0 and associated with releases on GitHub. It's good to include in the tag message a description of what the release is for. Then, you can specify the tag in your paper. Tracking versions with tags has a number of benefits:

Tags are more user-friendly than commit-hashes since they are human-readable.

Releases, which are built from tags, are easily accessible with GitHub's web interface.

When users clone the repository, they'll get the most up-to-date code by default, including any fixes we made since the last release.

Pre-Releases

When submitting a paper for review, you might want to make code available to reviewers without making a new release. For example, you might want to address reviewer comments before you make a new release. To handle this, we create pre-releases. These are versions just like those described above, except they have alpha or beta added to the end to signal that they are not yet complete. For example, let's say you're making a big new release that will be v3.0. You could create v3.0-beta.1 and make that available to reviewers. Then, you could address their comments in v3.0-beta.2. Once the paper's accepted and you've made any last changes, you can release v3.0. When you create the releases for v3.0-beta.1 and v3.0-beta.2 on GitHub, you can specify it as a pre-release so that GitHub marks it as such. This will tell users you aren't ready for them to use it yet.

If you want to avoid putting your code into the WholeCellEcoliRelease repository until after review, you can create a new temporary repository just for reviewers. One easy way to do this is to clone the WholeCellEcoliRelease repository and add the temporary repository as another remote. Then you can set up your tags and push to the temporary repository. Once the paper is accepted, you can push to the WholeCellEcoliRelease repository to make your releases public.

Well said, @U8NWXD!

Goals

The primary goal is to allow people to run the code to reproduce the published results.

The secondary goal to allow them to read, understand, and tinker with the code.

Much has been written about program reproducibility and it's far from a solved problem esp. with floating point math. Frankly Python and its libraries aren't built with this in mind. Do what you can to increase code reproducibility.

What we've done to date

rsync'd files from the wcEcoli repo and committed those (not even a squash merge).
Ditto for narrow changes made to improve the how-to docs and the ability to reproduce the runtime environment.
Added plots to match iterations when writing the Science paper.
Added Docker files to improve the ability to reproduce the runtime environment.
Accepted a couple requirements.txt updates from GitHub's @dependabot to get library security patches.
- I rejected one @dependabot PR (jinja2==2.11.3) since that release is incompatible with Python 2.
- @dependabot opened another PR yesterday to get a security patch in PyYAML==5.4. We need to figure out if that library update won't disturb this code snapshot or just reject it.
To prepare for your code release, I created the release tag v1.0 titled Science-2020-07-24, with this description:
This release snapshot Science-2020-07-24 goes with the paper Simultaneous cross-evaluation of heterogeneous E. coli datasets via mechanistic simulation published in Science, 24 July 2020. See docs/README.md for info on setting up the Python 2.7 runtime environment to run this release. (The next release will contain lots of work done since this snapshot forked off, and it runs on Python 3.8.)
- I used the GitHub web UI to make the tag and the release. It takes care of PGP signing and saves the title somewhere outside the git tag object.
There are some doc updates in the release repo that still need to get merged into the working repo.

Release Procedure

This seems like a good pattern for a new release for a new published paper:

A tag name v + semver major version number (next: v2.0),
a release title that identifies the journal and publication date, and
a description explaining that this repo contains a snapshot for a specific published article — not a collaborative repo open to Pull Requests — with a link to that article [We might have to edit the tag description to add this when the article gets published.]
describe other key points like the required version of Python.
I would make the published article refer to the snapshot by the tag name (v1.0) and the release name (Science-2020-07-24), and not by a git commit number.
Use minor version numbers in follow-up releases for the same published paper.

[Any changes or additions to this procedure?]

Assumptions

Whole new releases will be rare, and they're coming from a linear branch (master) in a single repo (wcEcoli) so a linear release history will work.
- If, however, we want to release a fork such as the operon branch without first merging that into master, then we'll need to create either a branch in the WholeCellEcoliRelease repo or a separate release repo.
- If multiple papers are in the works at the same time, try to make them use the same release snapshot.

Pre-Releases

Perhaps we should put each pre-release (for reviewers) in its own branch. It can have tags if they're useful. Using GitHub releases could confuse people, but it's doable using prerelease version numbers, as you wrote.
Note that an "alpha release" is by definition to deliver to testers internal to the organization and a "beta release" is by definition for external testers. The organization in this case is the Covert lab who have access to the wcEcoli repo, so there's probably no reason to put alpha releases in the release repo. "alpha" and "beta" are QA terms. A lot of the industry is confused about this.

Changelogs

Changelogs are very useful but when making a new snapshot associated with a new published article, maybe we can settle for a high level summary.

CovertLab / wcEcoli