bgruening commented 7 years ago

Introducing layer-donning based containers for biocontainers

TL;DR The technology behind this new approach of creating Docker containers is called layer-donning and was implemented by Jonas Weber as part of his Master thesis. The code is Open Source and implemented in https://github.com/involucro/involucro

I would like to propose involucro based containers as a new mechanism for creating biocontainers. This mechanism can create containers without any Dockerfile, reusing already existing recipes from other package manager, like brew, conda or alpine. The trick is simple: You have a build container with conda or brew installed and you have a run time container (defined by biocontainers) currently busybox. You run conda install my_pkg in your build container and copy the newly created layer to the run time environment - done. You will end up with a container that is as small as possible - usually a few MB. Have a look at the bedtools containers https://quay.io/repository/biocontainers/bedtools. Some slides can be found here: https://galaxy.slides.com/bgruening/docker-layer-donning/live#/ (not really good ones, I need to work on this).

Impact for biocontainers

creating a container means creating a bioconda package
testing a container means defining a test in bioconda, this should solve https://github.com/BioContainers/specs/issues/60 at least partially
adding metadata means annotating the conda yaml file
developing our tooling further means developinginvolucro further and improve/maintain this mechanism
Metadata handling

TL;DR related to https://github.com/BioContainers/specs/issues/60

Metadata is currently not saved inside of the container. We do not compile a container we are copying layers, so there is no compile and save step for metadata. Nevertheless, to be compliant with our BioContainers specification we could put conda-yaml file, which contains all metadata already, into the container.

However, I would like to see something more deeply integrated into other projects like bio.tools. For example consider the tool locarna in bioconda. It's an ordinary package with a container. But in this repository I have added this file: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/locarna/biotools.yaml. It defines all fields needed by the bio.tools ontology. This means bio.tools could query the bioconda recipes and knows hey there is a conda package and there is a Docker container. We do not need to do anything and can stay out of ontologies.

What we have so far

At the moment we have 1693 containers for ~1500 packages. This can be easily extended if I get more computational resources. Currently, every 24h all packages from bioconda will be converted into containers, via Travis, very transparent using a repository called auto-mulled.

Galaxy and the Galaxy tool SDK has gained involucro and mulled support so we have tooling to create such containers locally on-demand - also in Galaxy. Moreover, Galaxy can use these containers (in the dev version) without any extra step of annotating tools. I hope can be counted as a proof of concept that this concept is working.

How I think we can proceed with the integration

We should review all the containers we currently deploy using Dockerfiles to avoid namespace clash with the current involucro/bioconda based containers. In the future for best practices and as part of the architecture specifications (http://github.com/BioContainers/specs) in biocontainers we should review first if the docker container can be build using bioconda packages, if not we will fallback to the current approach using manual Dockerfile creation. This will improve our build, testing and deployment system.

I have thought a long time if this should be a separate project because the technologies are so different, in the end I think we all have the same aim - we want to have tools accessible to everyone no matter which technology we are using. Here I propose a solution that is the most easiest way to archive this goal.

Let me know what you think. I appreciate any feedback and questions! Bjoern

xref: https://github.com/bioconda/bioconda-recipes/issues/2297 ping @jmchilton, @joncison

sauloal commented 7 years ago

+100 :D

ypriverol commented 7 years ago

@bgruening I really would like to see this on production in biocontainers. In my opinion we should only evaluate how to encapsulate the metadata into each conda-based containers.

We should only carefully evaluate the namespace clashes. Especially for the containers that can be pushed to biocontainers registries and are biocontainers compliant but they will be hosted in their own github and with their own Dockerfile definition.

bgruening commented 7 years ago

Let me know what you have in mind. The travis cron job is running and it pushes regularly to biocontainers at the moment.

ypriverol commented 7 years ago

I have been thinking in the workflow in https://github.com/BioContainers/specs#24-biocontainers-architecture to avoid namespace crashes, etc. My proposal is the following:

If the container is requested to BioContainers community we evaluate first to possibility of using mulled as first-class option. If is possible then we generate the container using mulled. If the container can't be generated them we use our common approach using Dockerfile definition.

However, if a contributing project (let say OpenMS) produce the OpenMS team @timosachsenberg @hroest container using the Biocontainers specification, them we should be able to remove the previous mulled/biocontainers to delegate the responsibility to the contributing project and avoid the namespace clashes/conflicts. This will make BioContainers a more federated system, only centralised by our registries (dockerhub, quay.io) and the GitHub specification organisation.

jmchilton commented 7 years ago

@ypriverol This would be disappointing from a Galaxy perspective and I think more broadly from a reproducibility and remix-ability perspective. If there is a conda based recipe (and I know there is work in progress on this at least) then a consumer should know that is being used. In Galaxy, that means my users can be confident they are running the same application inside or outside a container and regardless all fairly modern Linux OS users can run the same binaries - inside Docker or out. With blackbox Dockerfiles we don't know how the software is built, we don't know if the same recipe would work out of Docker in a different OS. I think if you are selling this infrastructure to developers, hopefully the fact that many more people and platform should could use it if it were packaged through Conda would help? Why just build for Docker - when you could build for any potential container technology and platforms such as bare metal laptops and Supercomputers with Conda.

ypriverol commented 7 years ago

Hi @jmchilton thanks for your comments. I understand your points and we should them find the best way to avoid the namespace problem. In the near future we would have more contributors that would like to implement the biocontainers specification and push to the biocontainers registry using their own well maintained docker container. In my view, we will have more of those contributors and even if we guide them or recommend them to do conda-based containers, they will probably prefer to avoid the conda base system. Still, we need to have a compromise.

One option is to make in the specification mandatory that for all Biocontainers we should make a conda recipe, etc. But in my opinion one thing is recommend and the other is enforce to build on top of another system. Also depending on conda project status, supports, etc.
The other option is to wait until the problem arrive and see in particular cases of namespace problems and them talk with the developers of the project to create the conda recipe manually, etc.
Last but not least, we have the option of generating the containers with the pos or prefix conda but I hate that because them we can have a lot of duplication in the registry.

sauloal commented 7 years ago

@jmchilton @ypriverol @bgruening

Yesterday I spent a very long time talking to @bgruening . although he agrees that metadata is necessary, his point is that the metadata of mulled packages is the mulled git repository, the conda repository and the quay.io repository.

@bgruening, point is that an API to query the three of them would give all the data necessary and, most importantly, the metadata will not be set in stone.

His proposal is that mulled and biocontainers CLI would download the image AND the metadata. This would allow higher flexibility and, most importantly, higher consistency between different image formats which might not support metadata.

ypriverol commented 7 years ago

@sauloal Here we are not only discussing the metadata and automatic deployment of containers but also what is the best way to incorporate mulled/conda packages into #biocontainers. We should have in the specification a clear roadmap for all of our contributors. Some use cases and issues we should take into consideration:

Namespace clashes (highlighted by @bgruening): OpenMS team @timosachsenberg @hroest probably would like to maintain the Dockerfile following the Biocontainetrs specification and been able to push to our registries. However, we have our mulled/conda package for OpenMS. What to do? How to organize this? We have to define a clear path for the including of the layer-donning component. We have other projects planning to submit their containers as full providers, then we should make clear a road-map to deal with these namespace clashes.
Metadata: We need a clear definition of the metadata #68 that will be mandatory in BioContainers and how this metadata will be reflected in conda-based containers. These are two different things, we should define what is mandatory/optional for a container in order to be find, run, deploy, and maintain. The implementation of these is up to the technologies (Dockerfile, yalm, registries, etc) and mulled should be refined for that as @thriqon propose in
68 .

Here I would like to discuss the impact of this great change to BioContainers and what is the roadmap for the contributors, etc.

prvst commented 7 years ago

I feel that we are putting different aspects of the big scenario on a single thread. We have now different ideas on the table and we need to put together all pieces in order to make all projects agree to each other. We need to organize things better, maybe we need another conference. Until there, please careful to not hijack the threads with different topics.

ypriverol commented 7 years ago

I think @prvst we should discuss here the @bgruening proposal of mulled as layer-donning for Biocontainers. If this integration introduce new challenges we should basically highlight them here and discuss them, including the metadata, etc. This discussion will introduce more issues when we have a clear idea about all the challenges of the integration.

BTW, I think we should have a meeting to discuss some of this points. What about this Friday 14/10 (3:00 PM UK time).

prvst commented 7 years ago

@bgruening I think the idea of using involucro is promising. Do you know who is maintaining that? What are the risks of adopting this strategy?

ypriverol commented 7 years ago

@jmchilton @bgruening How do you think we can handle this point and solve this issue.

Namespace clashes (highlighted by @bgruening): OpenMS team @timosachsenberg @hroest probably would like to maintain the Dockerfile following the Biocontainetrs specification and been able to push to our registries. However, we have our mulled/conda package for OpenMS. What to do? How to organize this? We have to define a clear path for the including of the layer-donning component. We have other projects planning to submit their containers as full providers, then we should make clear a road-map to deal with these namespace clashes.

thriqon commented 7 years ago

@prvst, involucro is developed and maintained by me.

prvst commented 7 years ago

@thriqon oh, that's good. it's better when you have an inside project.

bgruening commented 7 years ago

Just wanted to add, it's an full Open Source project with all its goodies and I would expect biocontainers to help maintaining it if needed and if this is the future of containerization :)

@ypriverol we could give hand-crafted containers a tag-prefix so that workflow systems do not choose them by accident?

For the metadata discussion - what @sauloal summarized :) I agree that metadata is important but would like to cover this with tooling and a nice biocontainers CLI. This way we gain so much more flexibility at zero cost and can extend our system dynamically without rebuilding containers. E.g. embedding bio.tools into your search results and extend metadata.

thriqon commented 7 years ago

For the sake of completeness: Involucro can label images if the wrap step is setup with this: .withConfig({labels = { software = "planet", "version" = 2}}).

ypriverol commented 7 years ago

@thriqon This is possible a dummy question but I would like to have your response. Can involucro generate a Dockerfile for each container that we push automatically to quay.io with the metadata in order to be push to our dockerfile hub?

thriqon commented 7 years ago

In theory, it is certainly possible to generate a Dockerfile containing the some metadata with Involucro, but it is not possible to emulate the functionality offered by Involucro with a single Dockerfile. However, generating this rich set of metadata proposed here is not implementable in Mulled without considerable implementation overhead.

ypriverol commented 7 years ago

thanks @thriqon for your explanation.

ypriverol commented 7 years ago

After working with this technology for a while, create my own conda recepies and deploy them. I think we should support for most of our containers this strategy. I just release a new version of the registry-ui where the users can look for containers in both registries dockerhub and quay.io. Thanks to @bgruening for his work here. We are now showing Dockerfiles or Yalm files for both cases. I guess our community would be more than happy to mantain and contribute with involucro. LEt us know @bgruening and @thriqon what is the next step to move this forward.

bgruening commented 7 years ago

Awesome news! I see the following next steps:

create a list of classical containers that are missing in bioconda
create mulled based containers for them, via creating conda packages
helping maintaining involucro a little bit, meaning documentation, support etc.
enhancing meta-data collection by supporting bio.tools

Overall the feedback I get is overall very enthusiastic and most of the things are working, we should clean up all edges and make a very solid product out of it.

I already gave a workshop last week during the ELIXIR Rom meeting and these kind of events might be good to have more often. For example a dedicated contribution fest for Proteomics, Metabolomics and so on ... The bioconda people would be delighted to join I guess.

thriqon commented 7 years ago

As the developer of Involucro I have to say I'd be glad to have additional people on board there! There is a lot more that can be done, but since I'm not paid to work on it full-time anymore it got rather sleepy...

If anyone is interested in contributing, let me know and I'll be happy to give you an introduction.

joncison commented 7 years ago

Good work! As for enhancing the metadata and playing nice with bio.tools, we're very close now to releasing the candidate stable schema (https://github.com/bio-tools/biotoolsschema) for bio.tools. Once that's out, we can settle a bio.tools-compatible YAML, technical handshake for metadata sharing (import especially) with BioContainers et al.

bgruening commented 7 years ago

Just as a final remark before I close this issue, under https://quay.io/organization/biocontainers we have now 1924 different tools and libraries containerized. Happy new year!

prvst commented 7 years ago

starting the year by resolving an issue, nice! Happy new year!

BioContainers / specs