BioContainers / mulled

Mulled - Automatized Containerized Software Repository
65 stars 36 forks source link

Make BioContainers Nextflow compatible #148

Closed rspreafico closed 5 years ago

rspreafico commented 7 years ago

All Biocontainers ship with implementations of basic tools based on non-GNU implementations (BusyBox from 2014). Tools such as sleep, date, ls in these containers do not accept standard arguments and options, breaking pipelines for which Biocontainers could in principle be used as a 1:1 replacement for existing tools. In practice, these non-GNU tools break existing pipelines to an extent that discourages adoption of Biocontainers. For an example, see here:

https://github.com/nextflow-io/nextflow/issues/321

This is one example of many. Would it be possible to ship Biocontainers with a more modern, still lightweight base image?

bgruening commented 7 years ago

@rspreafico coreutils is nearly 10MB in size https://anaconda.org/bioconda/coreutils/files. We have choose a standard busybox image to keep it as small as possible. Do you run your entire pipeline in the busybox container or on the nextflow container? We could add coreutils as a dependency of the nextflow container. Imho this would be more correct as it seems nextflow needs GNU coreutils.

rspreafico commented 7 years ago

Typically each process in a Nextflow pipeline leverages a difference tool, thereby starting a distinct container. A Nextflow pipeline is a series of processes, therefore many different containers, possibly Biocontainers, get used. A Nextflow process runs both the bioinformatic tool and basic shell commands needed e.g. to pipe results or to list files - these shell commands are execute from within the tool's container. Many arguments to these basic shell commands, from ls to cut, are not recognized by the busy box implementation, breaking pipelines when one attempts to replace his own Docker images based on e.g. vanilla Ubuntu with the corresponding Biocontainer. Nextflow had a specific issue with sleep and date, which has been fixed in an upcoming release, but that is only part of the story. Most of the problem is with pipelines people have already written assuming the availability of standard basic GNU tools along with the bioinformatic tool: those won't work when Biocontainers are used.

So adding coreutils seems a good option, but that requires adding coreutils to the base image, in turn used by all tools. Nextflow is not executed from a container, so there is no need to add coreutils to the Nextflow container, but Nextflow will start many containers that will benefit from having coreutils.

bgruening commented 7 years ago

@rspreafico I understand, but this is a rabbit hole we have tried to avoid. What counts as coreutils, which version? Does the new --sandbox feature in sed count as standard? Even bash is not a standard anymore :( Ubuntu was using (is it still?) dash as shell and OS-X has it's own versions, incompatible with GNU coreutils. Unfortunately, there seems to be no standard as usual :( All this yielded us to the decision that we don't wanted to put any dependencies in the containers and tools, pipeline etc. should define them on there own as needed. The posix compliant features and a pure bash should be available in busybox, like pipes or ls, but just that very basic set.

Other pipelines for example do these post-processing, metadata extraction out-side of the container and or use an own metadata extraction container. - just because there is no fixed set of assumed dependencies.

That said, do you have a list of things that do not work out of the box now? Maybe we can decide this on a case to case basis? 10MB in addition for each container seems to be pretty heavy for me.

Thanks @rspreafico for your feedback!

rspreafico commented 7 years ago

Thank you @bgruening for your thorough explanation, and I understand your concern. Yes, I could create ad-hoc processes in the several affected pipelines to split bash jobs from a specific tool job. It would come with a time cost that would deter users from using Biocontainers, and it would make pipelines less efficient: Nextflow and similar tools use I/O for passing info through processes, which is far less efficient than piping. So one would want to minimize that. I have several issues with BusyBux minimal implementation of common tools. For example, ls does not accept the -m option, which I used to easily feed multiple files to tools such as Bowtie2. You are right that there are many derivatives of tools making standards... not so standard. However, many options, such as OSX (or BSD for that matter) are automatically "excluded" from the contest when one deals with Linux-based containers. Seems to me that the GNU tools as captured by coreutils would likely be the overwhelmingly winning implementation if there was a vote, and probably the most reasonable choice. The 10 Mb cost in the base image would be offset by availability in all containers, meaning that you'd pay the price only once for a specific (but reusable) layer. However, I'd understand the reasons if this is not a route that the project developers are willing to take.

bgruening commented 7 years ago

Hi @rspreafico, thanks for your answer!

Thank you @bgruening for your thorough explanation, and I understand your concern. Yes, I could create ad-hoc processes in the several affected pipelines to split bash jobs from a specific tool job. It would come with a time cost that would deter users from using Biocontainers, and it would make pipelines less efficient: Nextflow and similar tools use I/O for passing info through processes, which is far less efficient than piping.

True! We need to find a middle ground here. Using Containers in such a way is not efficient in general, but you gain isolation and reproducibility. So I guess we need to find a good way that satisfy both.

So one would want to minimize that. I have several issues with BusyBux minimal implementation of common tools.

Can we collect these somewhere and address them on by one?

For example, ls does not accept the -m option, which I used to easily feed multiple files to tools such as Bowtie2.

-m the m option from bowtie?

However, many options, such as OSX (or BSD for that matter) are automatically "excluded" from the contest when one deals with Linux-based containers. Seems to me that the GNU tools as captured by coreutils would likely be the overwhelmingly winning implementation if there was a vote, and probably the most reasonable choice.

Exactly, this needs to be discussed. I'm not so sure about it to be honest :( (sed is not in gnu-coreutils) and even if we do this, when do we update these tools? Do we update old containers with this new version of gnu_coreutils? If not we have inconsistent containers. If we do we have a namespace problem to solve and a maintenance problem.

(What you could do in the meantime is to mount your gnu_coreutils into the container as well, together with your data.)

rspreafico commented 6 years ago

Hi @bgruening,

a couple of examples of commands I had to rewrite to adapt pipelines to Biocontainers required to drop the -m option from ls (handy to add commas to list of files) and the -V option from sort (handy for karyotypic rather than lexicographic sort of chromosomes). There are more examples, but I was able to go around the limitations of BusyBox.

Related to this, and probably more importantly, I am trying to have Biocontainers play nicely with a very popular tools in bioinformatics, Nextflow (link). On the Nextflow end, developers have already made some fixes to allow running Biocontainers (see for example this issue). There remain compatibility issues that seems addressable better on the Biocontainers end rather than the Nextflow end (link), specifically it would suffice having the procps package available in Biocontainers.

Is this something you would consider? This would allow easy integration between Nextflow and Biocontainers, which would be a powerful combo to set up reproducible pipelines in record time. Either project is pretty popular, and if they could play well together it would be empowering.

Thank you.

fstrozzi commented 6 years ago

Hi @bgruening @rspreafico I will second this proposal. Having at least the procps package should be enough and will allow also collecting information and metrics from the running tool inside the container. This will be beneficial in turn for performance comparison across different utilities and workloads in the same isolated container environment, something I think it's also in the scope of BioContainers.

Thank you

bgruening commented 6 years ago

@rspreafico @fstrozzi I'm not strictly opposed to this, but I would like to keep the containers as minimal as possible. Is there any reason why these matrices should be collected from within the container? You could also have access to it with the Docker cgroups, like described here: https://docs.docker.com/engine/admin/runmetrics/

rspreafico commented 6 years ago

Hi @bgruening, thanks for your reply, I relayed that to @pditommaso, Nextflow's lead developer. There is a sister issue thread going on on Nextflow's Github here. Please do feel free to liaise directly with @pditommaso to find a common ground to seamlessly integrate these two great projects.

pditommaso commented 6 years ago

That could be possible, however still won't solve the problem for Singularity and Shifter containers, which do not use cgroup at all. How big is the psproc package? I would be surprised if it's more than 0.5 MB.

rspreafico commented 6 years ago

I checked, the package is around 200 kb. @bgruening is this something that you would consider adding to the base image?

bgruening commented 6 years ago

@rspreafico can you create a PR for https://github.com/bgruening/docker-busybox-bash I still think its better to use cgrups for Docker and put this logic into the workflow manager and not the Docker container.

rspreafico commented 6 years ago

@bgruening Thanks for being open to this. I do agree that it would be more elegant that the workflow manager takes advantage of the facilities offered by Docker, but I also agree with @pditommaso that the lack of such facilities with Singularity forces to adopt this solution.

I looked at the base BusyBox image, I will be happy to create a PR. I noticed it builds on the progrium BusyBox, which has a deprecation warning: "This image will probably soon be deprecated in favor of our even smaller Alpine Linux based image. Alpine is a minimal Linux distro designed with containers in mind, based on Busybox, with a real, modern package system".

The Alpine Linux image they refer to, still based on BusyBox, is here. It looks very actively maintained (commits from 6 days ago, vs 3-4 years), the authors state the image is even smaller than the progrium BusyBox, it's based on popular Alpine Linux, and it has apk. So should I take the chance of this PR to also update to Alpine's BusyBox? I checked that I can install both bash and procps from within the container with apk add --no-cache procps bash, so the Dockerfile should be really simple.

Finally, would you like to keep the :latest tag for this PR, or should I tag a specific version? The latest Alpine Linux is 3.6.