galaxy-iuc / standards

Documentation for standards and best practices from the Galaxy IUC
http://galaxy-iuc-standards.readthedocs.io/en/latest/
6 stars 16 forks source link

Formalize container best practice (esp. for complex tools) #37

Open jmchilton opened 7 years ago

jmchilton commented 7 years ago

tl;dr - Should it be a best practice to (1) register combinations of requirements for complex tools and publish all needed combinations to a container registry or (2) should Galaxy just build complex containers as it needs to for such tools.

I think there is probably broad consensus that the "mulled" approach to building containers should be part of a best practice for using containers with Galaxy. From an operations perspective this produces both tiny containers that are very easy and quick to deploy and manage, from a reproducibility and support perspective this allows the same (best-practice Conda) binaries to work on bare metal or inside of a container, and from a developer perspective this will ideally become much more transparent than a Dockerfile-based approach.

The follow up recommendation is less clear in my opinion. We currently have thousands of containers for individual requirements that can be used with tools that work with BioConda and only have a single requirement tag. For tools that contain multiple requirement tags - which I contend are not a corner case but a very mainstream and typical use case - we could recommend two different things as a best practice.

Put another way - should Galaxy (1) fetch the containers it needs or (2) build them.

Pros of (1) are:

Pros of (2) are:

Ping @bgruening, @mvdbeek, @jxtx.

bgruening commented 7 years ago

I actually see both approaches living in parallel. I think we should advertise building these containers upfront as best practice, but if they are not available we build them.

Somewhere on my ToDo list is to extend https://github.com/BioContainers/mulled and create a small tiny website to assemble conda packages and create mixed-mulled containers. The names should be normalised and hashed in a unique way. The aim is to get from a random assembled requirements.txt file (which the same packages) the same container back.

We could also think about to integrate this in the travis testing from IUC. So that IUC creates these containers on PR-merge.

I think there is a benefit in generating them outside of Galaxy, for the reasons you mentioned, but also because I want to generate more ... like Singularity images - and with this everyone can profit - and we get in turn more care and funding for BioConda.

jxtx commented 7 years ago

I actually see both approaches living in parallel. I think we should advertise building these containers upfront as best practice, but if they are not available we build them.

Do we register and push a container to an external repository when we build one?

I much prefer option 1 for reproducibility. I can see 2 being important for development, but I wouldn't like to see production Galaxy instances using this approach.

mvdbeek commented 7 years ago

I think we'll need both. We can't be sure that a dependency (in a container) really works (almost) everywhere until we have tested it in a bare bones container, so ideally the iuc tool tests would build and run (maybe also push on merge) the container. planemo could have a --local_container option for this.

For production instances we should probably not default to building locally. In addition to the reproducibility problem @jxtx mentioned, I think on busy sites building many containers at once could kill docker, in a way that is probably worse than activating many conda environments in parallel. Of course we could do the container building in a separate job on the cluster nodes, but then you'd have to build this at least once per worker or introduce some smarter logic to distribute the container.

Also if you build locally you will not know upfront if the built container will work, so what would you do if the container doesn't work? Rebuild until it does? That seems wasteful and is already a minus point for conda_auto_install.

jmchilton commented 7 years ago

Thanks all - I don't agree with every nuance - but in large part I agree with most of this. I appreciate yinz taking the time to respond. My goal for the next few days of development was to establish that we can state having an existing container is considered best practice. I'll take that and work on it. Hopefully we will have a process in place by the GCC.

I will however say in defense of (2) as long as it is cached it is no worse for reproducibility than allowing each site to install the binary dependencies locally once - as we do now and have always done. I get that (1) is much better than we've traditionally done so we should do it.

In response to conda_auto_install being problematic - that is just implemented very naively IMO. If that we implemented intelligently at all - it would be a lot less crappy and would be a perfectly fine idea.

mvdbeek commented 7 years ago

In response to conda_auto_install being problematic - that is just implemented very naively IMO. If that we implemented intelligently at all - it would be a lot less crappy and would be a perfectly fine idea.

I agree, I think this is a good idea for certain scenarios. I was just mentioning this as an example for the extra work that would be involved in managing the container lifecycle.