EuBIC / ReproducibleMSGuidelines

This project aims at defining a set of guidelines for reproducible mass spectrometry-based experiments.
https://eubic.github.io/ReproducibleMSGuidelines
Creative Commons Attribution 4.0 International
7 stars 1 forks source link

New item: Use of Docker images #15

Open jgriss opened 5 years ago

jgriss commented 5 years ago

Category: Workflow Software (or new section?)

Name: If containers are used in the analysis, they should be referenced through stable version numbers Category: "bronze" Description: If containers, such as Docker or Singularity containers, are used in the analysis they should be referenced through stable version numbers. This explicitly relates to not using the ":latest" tag for Docker images as these are bound to change upon new releases of the software. Fields: "all"

Name: If containers are used in the analysis, these should be available in a public repository Category: "silver" Description: If containers, such as Docker or Singularity containers, are used in the analysis they should be available through a public repository, such as Docker Hub. Fields: "all"

Reason

The use of containers becomes increasingly common and is increasingly supported by worfklow systems, such as nextflow

ypriverol commented 5 years ago

I think is better to reference here and standard community like biocontainers.pro to storage the containers. BioContainers define guidelines for containers creation and also provide an architecture to deploy and find bioinformatics contains.

In the same way, we suggested that the data should be in a ProteomeXchange repository. The container should be in Biocontainers.

jgriss commented 5 years ago

This is a very good point! But I guess we should still have an item that says "if you use containers, reference them following these guidelines"?

bittremieux commented 5 years ago

I agree with @jgriss that the version number should be made explicit, rather than ":latest".

Storage in BioContainers should probably be silver/gold level, whereas just having the container available somewhere should suffice as well.

Maybe something like this:

ypriverol commented 5 years ago

I agree with @jgriss that the version number should be made explicit, rather than ":latest".

The BioContainers guidelines do not allow to have the latest version since two years ago. All containers should contain the proper version. Storage in BioContainers should probably be silver/gold level, whereas just having the container available somewhere should suffice as well.

Having the container in your own namespace only will create more issues because, if the namespace disappears what can you do with the version of the docker file?

Maybe something like this:

  • Bronze: container file publicly available on a third-party resource, can also just be on GitHub
  • Silver: container uploaded to an official image registry, i.e. Docker Hub, ...
  • Gold: container available via BioContainers

I think we remove a lot of complexity saying that if a container is used the container should be deposited in biocontainers. Done

jgriss commented 5 years ago

@ypriverol I agree that we remove complexity but we also exclude quite a few use-cases.

Biocontainers are great for packaging single tools. What if a group uses one container to run their whole workflow (this is an example that the nextflow guides showed quite a lot) and also put their custom scripts into that container? The container would never be suitable for biocontainers.

The same is btw. also true for IsoProt.

Therefore, I prefer @bittremieux suggestion

ypriverol commented 5 years ago

@ypriverol I agree that we remove complexity but we also exclude quite a few use-cases.

Biocontainers are great for packaging single tools. What if a group uses one container to run their whole workflow (this is an example that the nextflow guides showed quite a lot) and also put their custom scripts into that container? The container would never be suitable for biocontainers.

The container is fine for biocontainer as far as it fits the guidelines: version, description, title, etc. You can put the custom scripts in the container or a conda package. Multitool containers are also supported since a year ago.

The same is btw. also true for IsoProt.

Probably this is an example of why do we need to put the container into biocontainer. My point is, if the container is in your namespace and you delete your namespace, the container will be gone, even if you have the version.

If we are creating guidelines for reproducible research and the bioinformatics community already have two community like conda and biocontainers to create guidelines about how to deploy containers, how to build them.. what is the point of avoiding them? You can have your own container during the development process, however when you are in the process of making your publication your containers should be release and properly annotated using the biocontainers guidelines and namespace.

This is like saying that the proteomics data can be public in a university FTP, when we have progressed a lot in ProteomeXchange.

Therefore, I prefer @bittremieux suggestion

jgriss commented 5 years ago

Hi guys,

Based on this discussion and an offline one with @ypriverol I updated my proposal. Since we are the first guideline to target Docker containers as well, I believe that this should be highlighted.

Category: Containers

Name: If containers are used in the analysis, they should be referenced following, for example, the BioContainers guidelines Category: "bronze" Description: If containers, such as Docker or Singularity containers, are used in the analysis they should be referenced through stable version numbers. This explicitly relates to not using the ":latest" tag for Docker images as these are bound to change upon new releases of the software. Detailed suggestions can be found in the BioContainers documentation Fields: "all"

Name: If containers are used in the analysis, these should be available in a public repository using non-personal namespaces Category: "silver" Description: If containers, such as Docker or Singularity containers, are used in the analysis they should be available through a public repository, such as Docker Hub. The namespace used to make this image publicly available should not be under a "private", user-based namespace but should use some kind of institutional namespace where long-term availability is ensured. This addresses the risk, that if private namespaces are used and the person changes careers, the namespace might be deleted and the images thus lost. Fields: "all"

Name: Containers should be available in dedicated repositories such as BioContainers Category: "gold" Description: Dedicated namespaces for bioinformatics tools ensure minimum standards of the containers and their long-term availability. Additionally, they have mechanisms in place to also support a wider range of platforms, such as BioConda. Fields: "all"

jgriss commented 5 years ago

Hi guys,

I've now added the proposed new items to the document. If you agree I'll close this issue and we will continue the discussion (if needed) on the different items.