Formalize container best practice (esp. for complex tools)

jmchilton commented 7 years ago

tl;dr - Should it be a best practice to (1) register combinations of requirements for complex tools and publish all needed combinations to a container registry or (2) should Galaxy just build complex containers as it needs to for such tools.

I think there is probably broad consensus that the "mulled" approach to building containers should be part of a best practice for using containers with Galaxy. From an operations perspective this produces both tiny containers that are very easy and quick to deploy and manage, from a reproducibility and support perspective this allows the same (best-practice Conda) binaries to work on bare metal or inside of a container, and from a developer perspective this will ideally become much more transparent than a Dockerfile-based approach.

The follow up recommendation is less clear in my opinion. We currently have thousands of containers for individual requirements that can be used with tools that work with BioConda and only have a single requirement tag. For tools that contain multiple requirement tags - which I contend are not a corner case but a very mainstream and typical use case - we could recommend two different things as a best practice.

(1) We could register every combination of requirements used in some repository and publish the required combinations to quay.io (I added the ability to do this with mulled when I ported it to galaxy-lib - xref https://github.com/galaxyproject/galaxy-lib/pull/15). There is a planemo issue to add tooling to make this fairly automatic https://github.com/galaxyproject/planemo/issues/646. I think it is important to give the developers the ability to build and test these things - but we could even make some of this automated by the tool shed as well.
(2) We could skip registering these tools and just rely on Galaxy's ability to build these on-demand. I've created an issue to make this a bit more robust and clear here https://github.com/galaxyproject/galaxy/issues/3673.

Put another way - should Galaxy (1) fetch the containers it needs or (2) build them.

Pros of (1) are:

From a Galaxy and admin perspective tools with multiple requirements are not handled differently than tools with single requirements.
I feel better about the reproducibility of this approach.
I feel better about the ability to exactly test the ultimate environments.
As hinted at in https://github.com/galaxyproject/galaxy/issues/3673 - this approach would work better with different deployment scenarios where nodes fetch there own containers by various mechanisms.
Provides more surface area to provide value added features - such as singularity containers.

Pros of (2) are:

This is more flexible adapts to new tools, requirements, channels, etc... on the fly.
Requires less upfront work by the tool author (or perhaps the tool shed).
No need to manage a large assortment of existing containers. If changes to the approach are needed we can just push a Galaxy patch and not need to update or rebuild containers. (Though I'm not sure we will ever need to rebuild anything if we get the testing right upfront).

Ping @bgruening, @mvdbeek, @jxtx.

bgruening commented 7 years ago

I actually see both approaches living in parallel. I think we should advertise building these containers upfront as best practice, but if they are not available we build them.

Somewhere on my ToDo list is to extend https://github.com/BioContainers/mulled and create a small tiny website to assemble conda packages and create mixed-mulled containers. The names should be normalised and hashed in a unique way. The aim is to get from a random assembled requirements.txt file (which the same packages) the same container back.

We could also think about to integrate this in the travis testing from IUC. So that IUC creates these containers on PR-merge.

I think there is a benefit in generating them outside of Galaxy, for the reasons you mentioned, but also because I want to generate more ... like Singularity images - and with this everyone can profit - and we get in turn more care and funding for BioConda.

jxtx commented 7 years ago

I actually see both approaches living in parallel. I think we should advertise building these containers upfront as best practice, but if they are not available we build them.

Do we register and push a container to an external repository when we build one?

I much prefer option 1 for reproducibility. I can see 2 being important for development, but I wouldn't like to see production Galaxy instances using this approach.

mvdbeek commented 7 years ago

I think we'll need both. We can't be sure that a dependency (in a container) really works (almost) everywhere until we have tested it in a bare bones container, so ideally the iuc tool tests would build and run (maybe also push on merge) the container. planemo could have a --local_container option for this.

For production instances we should probably not default to building locally. In addition to the reproducibility problem @jxtx mentioned, I think on busy sites building many containers at once could kill docker, in a way that is probably worse than activating many conda environments in parallel. Of course we could do the container building in a separate job on the cluster nodes, but then you'd have to build this at least once per worker or introduce some smarter logic to distribute the container.

Also if you build locally you will not know upfront if the built container will work, so what would you do if the container doesn't work? Rebuild until it does? That seems wasteful and is already a minus point for conda_auto_install.

jmchilton commented 7 years ago

Thanks all - I don't agree with every nuance - but in large part I agree with most of this. I appreciate yinz taking the time to respond. My goal for the next few days of development was to establish that we can state having an existing container is considered best practice. I'll take that and work on it. Hopefully we will have a process in place by the GCC.

I will however say in defense of (2) as long as it is cached it is no worse for reproducibility than allowing each site to install the binary dependencies locally once - as we do now and have always done. I get that (1) is much better than we've traditionally done so we should do it.

In response to conda_auto_install being problematic - that is just implemented very naively IMO. If that we implemented intelligently at all - it would be a lot less crappy and would be a perfectly fine idea.

mvdbeek commented 7 years ago

In response to conda_auto_install being problematic - that is just implemented very naively IMO. If that we implemented intelligently at all - it would be a lot less crappy and would be a perfectly fine idea.

I agree, I think this is a good idea for certain scenarios. I was just mentioning this as an example for the extra work that would be involved in managing the container lifecycle.

galaxy-iuc / standards

Formalize container best practice (esp. for complex tools) #37