clamsproject / clams-python

CLAMS SDK for python
http://sdk.clams.ai/
Apache License 2.0
4 stars 1 forks source link

"continuous deployment" and `app_version` in the app metadata #114

Closed keighrim closed 1 year ago

keighrim commented 1 year ago

tl;dr This conversation made me question the usefulness (or harmfulness) of the app_version in the app metadata. Scroll down to see my proposal (note; don't confuse app_veriosn and analyzer_version).

Some background

  1. As we all understand, the CLAMS project as a whole started from the fruits (some premature then) of the LAPPSgrid project, and the basic architecture of the clams-app is based on the ideas we used in lapps-services. The app_version and analyzer_version fields in the app metadata were direct adoption of version and toolVersion from the lapps service metadata specification.
  2. Lappsgrid was largely a Java project, and more concretely versioning, building, and deployment of services were mostly based on the maven build system and ecosystem, where we had the concept of -SNAPSHOT versions that work as a throw-all version between proper releases. E.g. all compilations between v1.0.0 and, say, v.1.1.0 will use v1.1.0-SNAPSHOT version number.
  3. One of the core values we want to be, and have been, pursuing in the clams project (and its ancestor projects and research) not only as an engineering practice but also as a scientific study is the reproducibility of pipelines. Namely, we want to record as many details as we can in the output MMIF files so that if one wants to re-create the pipeline, one should be able to do so.

Now, the problem

In clams apps, 1) we don't have the concept of "snapshot" (or "nightly" or "bleeding-edge" or whatever you call it), and the app_version value in the app metadata is 2) manually maintained by the app developer. I think this is an evil practice, as any source code pulled from the develop branch of an app will report a false app_version number (most likely from the previous stable release, if there was a merge from the stable branch) to the users.

Proposal

App developers keep using git tags for stable releases of apps as we have been doing. However, developers should not maintain the app_version manually hard-coded in the source code (usually in app.py). Instead, the app_version value should be injected programmatically by either the clams-python SDK (at the runtime) or the build process (at the build time).

Implementation

As we are all using git for managing codebases of clams apps, and also expecting futures apps to use git (and github/gitlab) as well, I think we can write a simple logic that first looks for a git tag on the current source code tree (for stable versions) and when none found, falls back to the commit number. This will provide more fine-grained traces of which source code was used to produce certain annotations in the resulting MMIF. When there's not even a commit number (i.e. the code is running in a directory that does not have a .git directory), it finally falls back to some string (a short one like unknown or a longer one like this-app-is-running-without-version-control-information-so-the-pipeline-cannot-guarantee-its-reproducibility) so that users can recognize the "un"-reproducibility of the pipeline.

Looking forward to hearing from others.

keighrim commented 1 year ago

Talked about this yesterday with @marcverhagen and @kelleyl , and we all agreed on the problem and the solution. We also agreed that the version injection should happen at the runtime, specifically in __init__() of ClamsApp AB class. Marc suggested using git describe command to generate version strings.

mrharpo commented 1 year ago

git describe

This will work for local development environments, but might not be best for production images.

It would require:

  1. git to be installed on all app images
    • Definitely possible, but is a substantial requirement, especially for a single function.
  2. at least a shallow clone of the .git information copied into the image.
    • Again, very possible, but it seems like an anti-pattern

Options

Inject as environment variable

APP_VERSION=x.y.z Then in something like version.py or config.py

from os import environ
APP_VERSION = environ.get('APP_VESION')
# Defaults to None if not set

Copy into container at build time

Automated versioning

There are automation tools to help with this. Something like versioneer

(I don't have any experience with this tool, just found it with a quick google search)

Other?

Very open to other options, but automation or environment variables have my vote so far.