BioContainers / specs

BioContainers specifications
http://biocontainers.pro
Apache License 2.0
49 stars 12 forks source link

reformatting dockerfile template #68

Closed prvst closed 7 years ago

prvst commented 7 years ago

I'm going to propose a new template for the Dockerfile, I'm planning to clean it and make it simple. Parts of the changes will include: removing the cmd commands from the header, adding the METADATA tag for header fields and cleaning all those '#' signs.

prvst commented 7 years ago

Here it is my proposal for a new template:

  1. No more all those '#' sings everywhere.
  2. The top of the file (the "header") should be composed by only 3 main information; the base image, the metadata and the maintainer (no updating dates, that will be recorded on github).
  3. No excessive comment on obvious things like, adding files, or changing users, only important things.

I think Dockerfiles should not be used as "documentation", the important information inside is added to labels and then anyone can get them by running docker inspect [OPTIONS] CONTAINER|IMAGE|TASK [CONTAINER|IMAGE|TASK...] (Check the Documentation)

# Base Image
FROM biodckr/biodocker

# Metadata
LABEL version=1
LABEL software=Comet
LABEL software.version=2015020
LABEL description="basic local alignment search tool"
LABEL website="http://comet-ms.sourceforge.net/"
LABEL tags=proteomics
LABEL base.image:="biodckr/biodocker"

# Maintainer
MAINTAINER Felipe da Veiga Leprevost <felipe@leprevost.com.br>

USER root

RUN apt-get clean all && \
    apt-get update -y && \
    apt-get upgrade -y && \
    apt-get install -y wget && \
    apt-get clean && \
    apt-get purge && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

ADD comet.2015020.linux.exe /home/biodocker/bin/

RUN chmod +x /home/biodocker/bin/comet.2015020.linux.exe

USER biodocker

WORKDIR /data

CMD ["comet.2015020.linux.exe"]

@BioContainers/contributors; Please, let me know what do you think.

ypriverol commented 7 years ago

Hi @prvst : I like this new metadata structure. A couple of ideas to be consider:

prvst commented 7 years ago

@BioContainers/contributors ; Updated:

# Base Image
FROM biodckr/biodocker

# Metadata
LABEL version=1
LABEL software=Comet
LABEL software.version=2015020
LABEL description="basic local alignment search tool"
LABEL website="http://comet-ms.sourceforge.net/"
LABEL documentation="http://comet-ms.sourceforge.net/parameters/parameters_201601/"
LABEL license="https://www.apache.org/licenses/LICENSE-2.0"
LABEL tags=proteomics
LABEL base.image:="biodckr/biodocker"

# Maintainer
MAINTAINER Felipe da Veiga Leprevost <felipe@leprevost.com.br>

USER root

RUN apt-get clean all && \
    apt-get update -y && \
    apt-get upgrade -y && \
    apt-get install -y wget && \
    apt-get clean && \
    apt-get purge && \
    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

ADD comet.2015020.linux.exe /home/biodocker/bin/

RUN chmod +x /home/biodocker/bin/comet.2015020.linux.exe

USER biodocker

WORKDIR /data

CMD ["comet.2015020.linux.exe"]
bgruening commented 7 years ago

I like the this a lot for Dockerfile based containers. We could even consider getting some travis testing done and check the metadata to ensure consistency.

Nevertheless, for mulled based containers these metadata does not make much sense. It's either encoded in a different way, not needed or should be encoded more upstream, namely in the conda package.

Therefore, I would vote to make this spec mandatory for Dockerfile based containers.

bgruening commented 7 years ago

@prvst I have written a little bit about metadata for the involucro based containers here: https://github.com/BioContainers/specs/issues/69

ypriverol commented 7 years ago

@prvst As @bgruening said this metadata can be encoded in different ways. Some me the fields can be optional but some of the should be mandatory like:

BTW why do we have two versions here version and software.version. As far as I remember we agree to have a container for each software version making easy to identified both. If not we should open a discussion again about versioning.

sauloal commented 7 years ago

@bgruening . i understand that involucro is a wrapper around docker so i guess there's a way to 'convert' the variables present in the yml file to the metadata of the image, correct?

Adding the yml inside of the image does not suffice once it would require downloading, mounting and reading the file in order to get the information.

sauloal commented 7 years ago

@ypriverol @prvst . I think the metadata has to be better described. for example. there can be multiple programs in a package. there can be multiple websites.

I propose that we use the dot notation more efficiently

e.g.

# Metadata
LABEL version=1
LABEL softwares=Comet,Planet
LABEL softwares.Comet.version=2015020
LABEL softwares.Comet.description="basic local alignment search tool"
LABEL softwares.Comet.website="http://comet-ms.sourceforge.net/"
LABEL softwares.Planet.version=2015020
LABEL softwares.Planet.description="basic local alignment search tool"
LABEL softwares.Planet.website="http://Planet-ms.sourceforge.net/"
LABEL tags=proteomics
LABEL base_image="biodckr/biodocker"
LABEL maintainer=Felipe

please notice the comma separated list of names, and that base_image does not have a period anymore once periods would only be used to denote 'belongs to'.

Also, maintainer could also be added here so that all the data is in the metadata and the dockerfile has no need to be 'parsed'

singular names would denote comma separated lists or (period) nested lists.

sauloal commented 7 years ago

@ypriverol if i remember correctly, the version refers to the makefile version. if a software is updated but the dockerfile is not (for example apt-get based programs or programs using the -latest suffix), we can update the program without changing the dockerfile at all (except for the metadata)

bgruening commented 7 years ago

@sauloal it's not directly a wrapper around Docker, it copies layers on a very low level basis using the internal GO Docker API. I need to look at the API again to see if there is a way to set metadata. We can easily put them into the repository description of Quay.io if this is sufficient. I think most of the above attributes are not needed in case of automatically generated containers and the rest we can add into the repository? What do you think about my bio.tools idea to use a standardized way annotate these images?

sauloal commented 7 years ago

@bgruening . That's right.

if you are using a API, please try to find a way to add the YAML info to the container.

I don't think you should worry so much in agreeing with our 'docker based containers annotation' as long as you add a 'source: SOMETHING'. there's no point in forcing all the packages made in anaconda to agree with our demands but if you add their information (which is already quite good) we will have the information from all 1000+ packages already available.

sauloal commented 7 years ago

@bgruening also, the importance of the metadata is that we can have a scrapper website keeping the database updated automatically instead of having to do it manually

sauloal commented 7 years ago

@bgruening I LOVE the idea of bio.tools with a ontology of software. Ontologies are fantastic for finding and discovering. totally in favour.

BUT

not compulsory though. ontologies are verbose, complex and time consuming to fill. this could easily constitute a entry barrier.

bgruening commented 7 years ago

Not if we do it automatically :) I imagine something like bio.tools cares about annotations and we just link our Containers to it. So querying goes over bio.tools.

sauloal commented 7 years ago

Oh. then perfect. we can do a lot with that. imagine a website with a tree-like view of all the programs we have :D

On Sun, 9 Oct 2016 at 21:49 Björn Grüning notifications@github.com wrote:

Not if we do it automatically :) I imagine something like bio.tools cares about annotations and we just link our Containers to it. So querying goes over bio.tools.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BioContainers/specs/issues/68#issuecomment-252508163, or mute the thread https://github.com/notifications/unsubscribe-auth/AAe600pWs2lco_vd1SgATMH-IPMMuOcHks5qyUVTgaJpZM4KQHvZ .

bgruening commented 7 years ago

Exactly! And small icons indicating Docker, Conda, DebianMed as installation method and all of them share one annotation.

sauloal commented 7 years ago

PERFECT. but for that to work i would rather not have a database but a scrapper searching all the images under our umbrella and extracting the metadata. no sync needed

On Sun, 9 Oct 2016 at 21:53 Björn Grüning notifications@github.com wrote:

Exactly! And small icons indicating Docker, Conda, DebianMed as installation method and all of them share one annotation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BioContainers/specs/issues/68#issuecomment-252508425, or mute the thread https://github.com/notifications/unsubscribe-auth/AAe60w5PKkzK_Ip6M52U-1AMFFMQ5TuIks5qyUZKgaJpZM4KQHvZ .

thriqon commented 7 years ago

As the author of involucro I have to admit that there is no way to set labels yet. I'll consider this thread a feature request and open an appropriate issue with involucro soon . Support for labels can then be added to mulled easily (with the information the repo provides), probably to auto-mulled as well...

thriqon commented 7 years ago

I have to revise my previous comment. There is a way to set labels already available in involucro, even though it's well hidden behind the withConfig escape hatch. I'll provide you with docs tomorrow and start development on this for mulled later this week.

sauloal commented 7 years ago

Great!!!! Could you then re-build the current images with the metadata?

On Sun, 9 Oct 2016 at 22:08 Jonas Weber notifications@github.com wrote:

I have to revise my previous comment. There is a way to set labels already available in involucro, even though it's well hidden behind the withConfig escape hatch. I'll provide you with docs tomorrow and start development on this for mulled later this week.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BioContainers/specs/issues/68#issuecomment-252509294, or mute the thread https://github.com/notifications/unsubscribe-auth/AAe603mTWiSC4keqtHnn8XFTk2PlJ2_6ks5qyUm_gaJpZM4KQHvZ .

bgruening commented 7 years ago

@thriqon you can set labels with -l, --label value Set meta data on a container (default []) during the build step I guess, I just looked it up.

@sauloal rebuilding 1800 images? No ;)

bgruening commented 7 years ago

@sauloal which kind of metadata do you want to see as label?

ypriverol commented 7 years ago

@thriqon you highlighted a very important point. I think what ever we discuss here as a community we can improve and elaborate in a better way involucro. Thanks for your support and this idea.

@sauloal You just highlighted a very good point. When multiple softwares are build in one heavy container I don't know what is the best way to put all the LABEL information about the software. In my view a software can contains multiple software within the same framework, but we need to TAG the big framework and not the individual packages. For example, OpenMS, Galaxy, TPP, etc, all of them contains tools multiple tools inside but we should TAG only the big framework. The problem is that if you open the scope to TAG multiple software components, then you can have people asking about the versions, and the other tags. This is to complex.

ypriverol commented 7 years ago

@bgruening I guess @saulo is talking about the tags that we are defining here been able to push them thought involucro to the mulled containers.

sauloal commented 7 years ago

I was thinking on all/most of the tags already available on the YAML you could even ALSO add the whole YAML as a single tag by encoding the file as a string. I saw examples of JSON being added in a single tag.

On Sun, 9 Oct 2016 at 22:18 Björn Grüning notifications@github.com wrote:

@sauloal https://github.com/sauloal which kind of metadata do you want to see as label?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BioContainers/specs/issues/68#issuecomment-252509893, or mute the thread https://github.com/notifications/unsubscribe-auth/AAe60ytnkQTPrQ4MUcb0vlzS1AyLL0hOks5qyUwVgaJpZM4KQHvZ .

bgruening commented 7 years ago

@sauloal ok thats something I can think of. Your only usecase is to make it search-able by some scripts, right?

sauloal commented 7 years ago

@bgruening 1800. that's peanuts :D but seriously, if we want to be able to query them in the future, they will need to be re-made. and if my guess is correct, rebuilding with the tag will not create any new layers.

sauloal commented 7 years ago

@bgruening yes.

ypriverol commented 7 years ago

@sauloal @bgruening the complete idea having tags is been able to search, find and also understand the containers. As much concrete metadata we can add to make the container more reliable and easy to understand, them we will be better.

bgruening commented 7 years ago

@sauloal thats not true ;) I can simply put it into the repository description. This is way simpler and has the benefit of being available to the user visiting the Quay.io website as well. Take for example this repo: https://quay.io/organization/biocontainers/bedtools would this (and more infos) be sufficient? mulled-search our commandline utility already makes use of these informations. It's more or less similar to docker search.

An other advantage is that I can add at any time more informations to the repo without rebuilding the Images.

ypriverol commented 7 years ago

We need to agree how we will handle multiple softwares in the same container. My vote is for something that contains single version, single software name in the container. But I understand also the @sauloal 's idea of tagging multiple software in the same container. What about this:

# Metadata
LABEL version=1
LABEL software=Comet
LABEL software.version=2015020
LABEL description="basic local alignment search tool"
LABEL website="http://comet-ms.sourceforge.net/"
LABEL website="http://another.url.net/"
LABEL documentation="http://comet-ms.sourceforge.net/parameters/parameters_201601/"
LABEL license="https://www.apache.org/licenses/LICENSE-2.0"
LABEL tags=proteomics
LABEL base.image:="biodckr/biodocker"
LABEL included.software = comet-convert, mzXMLValidator, .. 

In my view some LABELs can MUST be present always (software, version, description), others can be optional (documentation). In my view some of the labels MUST be unique (software, version, description), where other can be more than one value (encoded by comma as suggested by @sauloal ).

sauloal commented 7 years ago

@bgruening the quai.io method would work. for quai.io . not for dockerhub. or for the next one we use. the idea of metadata is that it would be an intrinsic property of the container itself. if you use the exact same information for other proprietary metainformation system, great. but this way, anyone who downloads the package can check its metadata with no help (using docker inspect or example).

sauloal commented 7 years ago

@ypriverol i think a better compromise would be to use singular for single programs and plurals for 'heavy' ones

Single

# Metadata
LABEL version=1
LABEL software.name=Comet
LABEL software.version=2015020
LABEL software.description="basic local alignment search tool"
LABEL software.website="http://comet-ms.sourceforge.net/"
LABEL software.website="http://another.url.net/"
LABEL software.documentation="http://comet-ms.sourceforge.net/parameters/parameters_201601/"
LABEL software.license="https://www.apache.org/licenses/LICENSE-2.0"
LABEL tags=proteomics
LABEL base_image:="biodckr/biodocker"

Heavy

# Metadata
LABEL version=1
LABEL softwares.Comet.version=2015020
LABEL softwares.Comet.description="basic local alignment search tool"
LABEL softwares.Comet.website="http://comet-ms.sourceforge.net/"
LABEL softwares.Comet.website="http://another.url.net/"
LABEL softwares.Comet.documentation="http://comet-ms.sourceforge.net/parameters/parameters_201601/"
LABEL softwares.Comet.license="https://www.apache.org/licenses/LICENSE-2.0"
LABEL softwares.Planet.version=2015020
LABEL softwares.Planet.description="basic local alignment search tool"
LABEL softwares.Planet.website="http://comet-ms.sourceforge.net/"
LABEL softwares.Planet.website="http://another.url.net/"
LABEL softwares.Planet.documentation="http://comet-ms.sourceforge.net/parameters/parameters_201601/"
LABEL softwares.Planet.license="https://www.apache.org/licenses/LICENSE-2.0"
LABEL tags=proteomics
LABEL base_image:="biodckr/biodocker"
ypriverol commented 7 years ago

@sauloal this heavy approach is difficult to parse, standardize and understand.

sauloal commented 7 years ago

@ypriverol in my opinion, this is the easiest way to create a 'dictionary'. the other option in my view would be to have one line with all fields:

softwares=name:Comet;version:1.2... etc softwares=name:Planet;version:2.2... etc

thriqon commented 7 years ago

@sauloal, it will create new layers iff the contents of the image layer generated by the second run are not exactly equal, i.e. always. Mulled doesn't use the default docker builder.

ypriverol commented 7 years ago

Hi @sauloal Why not making first the definition for the single containers, have a final definition and them introduce a new issue where we can discuss more in details how to encode multiple softwares into the metadata. @bgruening @prvst what do you think?

prvst commented 7 years ago

I agree with @ypriverol , we need to define one thing at a time

ypriverol commented 7 years ago

I have been taking a look to other initiatives like BioConda to define the metadata, I like this option:

# Metadata
LABEL version=1
LABEL software=Comet
LABEL software.version=2015020
LABEL description="basic local alignment search tool"
LABEL website="http://comet-ms.sourceforge.net/"
LABEL website="http://another.url.net/"
LABEL documentation = "http://comet-ms.sourceforge.net/parameters/parameters_201601/"
LABEL license  = "https://www.apache.org/licenses/LICENSE-2.0"
LABEL tags = proteomics
LABEL base.image = "biodckr/biodocker"
LABEL command = "comet"
LABEL command = "validator"

We can not reproduce all the complexity of a software framework. Then, I suggest to go for a more practical approach where we define all the command that can be use in the tool, as tools.

prvst commented 7 years ago

I don't like much the idea of having commands defined or exemplified as metadata, some software have a lot of options and that can pollute the header with text.

prvst commented 7 years ago

OK, we will stay with that version for the first iteration, adjustments will be done with time, so I will close the issue in order to push the matter forward.