bioboxes / rfc

Request for comments on interchangeable bioinformatics containers
http://bioboxes.org
MIT License
40 stars 9 forks source link

Improved versioning #212

Open abremges opened 7 years ago

abremges commented 7 years ago

Triggered by https://github.com/bioboxes/camiarquikr/issues/1

To me, it seems that there is no (easy) way to pull a biobox for a specific tool version? I propose to somehow implement this. I'm aware that you can specify the biobox version via its commit hash, but there seems to be no obvious mapping of tool version <-> biobox version. Also, if the biobox is automatically built from the master, then we have no real control over which tool version is packaged (unless we build manually).

What am I missing?

abremges commented 7 years ago

On a related note, I propose that bioboxes should print both version numbers (tool and container build) upon execution.

fungs commented 7 years ago

IMO tool versions could translate to docker tags but that's quite specific to Docker and makes it hard to support other backends in the future.

For support of meta-info, it would be quite easy to implement specific command line arguments, for instance that would mean to create a string in a standard text file with the template container which I created some while ago. This Debian-based template is much more modular and and structured than what most current bioboxes are built upon, but unfortunately it has not been used lately. It is also easier to set up as all yaml data is available via global shell variables to the programs.

https://github.com/bioboxes/bbx-base

fungs commented 7 years ago

However, I would rather prefer to not have to call a container for metadata but instead deliver that via the biobox specification and link that to the biobox.

michaelbarton commented 7 years ago

On a related note, I propose that bioboxes should print both version numbers (tool and container build) upon execution.

I think this could be addressed via bioboxes/rfc#213.

michaelbarton commented 7 years ago

IMO tool versions could translate to docker tags but that's quite specific to Docker and makes it hard to support other backends in the future.

I agree. In the short-term I imagine the majority of use is via docker. We did however discuss supporting singularity versions of bioboxes at the CAMI workshop. Perhaps you could suggest an example schema for supporting bioboxes across multiple containerisation platforms.

michaelbarton commented 7 years ago

To me, it seems that there is no (easy) way to pull a biobox for a specific tool version? I propose to somehow implement this. I'm aware that you can specify the biobox version via its commit hash, but there seems to be no obvious mapping of tool version <-> biobox version. Also, if the biobox is automatically built from the master, then we have no real control over which tool version is packaged (unless we build manually).

We had a similar discussion at the JGI. The problem we identified with using docker tags is that they are not immutable, and you can not necessarily rely on third party developers sticking to semantic versioning when releasing and tagging new images.

As an example imagine in CAMI you benchmark a docker image tagged v1.0 and report the results. In the meantime the developer finds there's a bug or make and improvement, pushes a new version, then tags that as v1.0 of the image. Anyone doing docker pull on a docker image using the tag v1.0 would get the new version of the image, rather than the one benchmarked in CAMI.

Using the docker image SHA256 digest means that there is no ambiguity, and you get exactly the same image you that was previously benchmarked, or used by a collaborator. In nucleotides instead use version numbers almost as the human-readable names of an image, but use the SHA256 digest when benchmarking.

michaelbarton commented 7 years ago

This Debian-based template is much more modular and and structured than what most current bioboxes are built upon, but unfortunately it has not been used lately. It is also easier to set up as all yaml data is available via global shell variables to the programs.

My experience is that the software tools that are more generally adopted are those that have more documentation and examples. We heard previously that creating a biobox image is not easy for those not familiar with the bioboxes specification, e.g. bioboxes/rfc#131 and bioboxes/rfc#165. Perhaps if you have some time, you could create a tutorial for the bioboxes website on how to create a biobox image using your bbx-base image? This could lead to this image being used more as a base. I would certainly be happier to use a different base image if it made creating bioboxes easier.

One concern I have however is that flags such as --shell or --all-the-bioboxes-docker-options-and-mounts, or the use of environment variables for paths are not part of the bbx specification. The bbx-base image is the only image that supports these. I think we should have a discussion of the pros/cons of these, and add this to the RFC if all bioboxes supporting this would improve the user experience.

fungs commented 7 years ago

True, documentation increases usage but before we should come to an agreement for all the images created by the bioboxes team, to adopt a common way and base image.

The mentioned options are for convenience to make implementation of new images easier. The bioboxes specs do not cover some parts like internal folder structure and I've tried to come up with a unique bioboxes namespace in UNIX style. The options help to coordinate these kind of things. I don't think they have to be part of the bioboxes spec, they are just the sugar which you get when using the template image. They need not be part of the specs because they are only relevant for how the container works internally, for instance variables, but the caller does not need to know about it, he calls a black box with bioboxes syntax.

michaelbarton commented 7 years ago

On 05/19, Johannes Dröge wrote:

True, documentation increases usage but before we should come to an agreement for all the images created by the bioboxes team, to adopt a common way and base image.

I agree.

This Debian-based template is much more modular and and structured than what most current bioboxes are built upon, but unfortunately it has not been used lately.

My original comment was to address your above point.

I don't think they have to be part of the bioboxes spec, they are just the sugar which you get when using the template image. They need not be part of the specs because they are only relevant for how the container works internally, for instance variables, but the caller does not need to know about it, he calls a black box with bioboxes syntax.

My concern is that if these command line options are supported by some bioboxes but not others, it could be a source of confusion.

abremges commented 7 years ago

Maybe I'm missing the point, but is there any reason why https://github.com/bioboxes/biobox-minimal-base won't work as such a common starting point?

fungs commented 7 years ago

@abremges: Good point. The code for both these is partially redundant. I did not know about the second but it seems we have several base images. That is, what I meant to address with my comment.

The former is two years old and has

The latter is younger and has

So I guess it would be supereasy to implement two on top of one or to unite them into a single image.

@michaelbarton: concerning the command line parameters, these are completely optional and only require that a task name does not start with two dashes. However, the provided options are very useful when you want to find out where parameters are mapped internally and for debugging your container in the development. I think that providing them will have more benefits than removing them because they are not part of the specification. As I said, they should not be specified because they are very specific to the implementation of the base image.