Closed rspreafico closed 5 years ago
@rspreafico coreutils is nearly 10MB in size https://anaconda.org/bioconda/coreutils/files. We have choose a standard busybox image to keep it as small as possible. Do you run your entire pipeline in the busybox container or on the nextflow container? We could add coreutils as a dependency of the nextflow container. Imho this would be more correct as it seems nextflow needs GNU coreutils.
Typically each process in a Nextflow pipeline leverages a difference tool, thereby starting a distinct container. A Nextflow pipeline is a series of processes, therefore many different containers, possibly Biocontainers, get used. A Nextflow process runs both the bioinformatic tool and basic shell commands needed e.g. to pipe results or to list files - these shell commands are execute from within the tool's container. Many arguments to these basic shell commands, from ls
to cut
, are not recognized by the busy box implementation, breaking pipelines when one attempts to replace his own Docker images based on e.g. vanilla Ubuntu with the corresponding Biocontainer. Nextflow had a specific issue with sleep and date, which has been fixed in an upcoming release, but that is only part of the story. Most of the problem is with pipelines people have already written assuming the availability of standard basic GNU tools along with the bioinformatic tool: those won't work when Biocontainers are used.
So adding coreutils seems a good option, but that requires adding coreutils to the base image, in turn used by all tools. Nextflow is not executed from a container, so there is no need to add coreutils to the Nextflow container, but Nextflow will start many containers that will benefit from having coreutils.
@rspreafico I understand, but this is a rabbit hole we have tried to avoid. What counts as coreutils, which version? Does the new --sandbox
feature in sed
count as standard?
Even bash is not a standard anymore :( Ubuntu was using (is it still?) dash
as shell and OS-X has it's own versions, incompatible with GNU coreutils. Unfortunately, there seems to be no standard as usual :(
All this yielded us to the decision that we don't wanted to put any dependencies in the containers and tools, pipeline etc. should define them on there own as needed. The posix compliant features and a pure bash should be available in busybox, like pipes or ls, but just that very basic set.
Other pipelines for example do these post-processing, metadata extraction out-side of the container and or use an own metadata extraction container. - just because there is no fixed set of assumed dependencies.
That said, do you have a list of things that do not work out of the box now? Maybe we can decide this on a case to case basis? 10MB in addition for each container seems to be pretty heavy for me.
Thanks @rspreafico for your feedback!
Thank you @bgruening for your thorough explanation, and I understand your concern. Yes, I could create ad-hoc processes in the several affected pipelines to split bash jobs from a specific tool job. It would come with a time cost that would deter users from using Biocontainers, and it would make pipelines less efficient: Nextflow and similar tools use I/O for passing info through processes, which is far less efficient than piping. So one would want to minimize that. I have several issues with BusyBux minimal implementation of common tools. For example, ls does not accept the -m option, which I used to easily feed multiple files to tools such as Bowtie2. You are right that there are many derivatives of tools making standards... not so standard. However, many options, such as OSX (or BSD for that matter) are automatically "excluded" from the contest when one deals with Linux-based containers. Seems to me that the GNU tools as captured by coreutils would likely be the overwhelmingly winning implementation if there was a vote, and probably the most reasonable choice. The 10 Mb cost in the base image would be offset by availability in all containers, meaning that you'd pay the price only once for a specific (but reusable) layer. However, I'd understand the reasons if this is not a route that the project developers are willing to take.
Hi @rspreafico, thanks for your answer!
Thank you @bgruening for your thorough explanation, and I understand your concern. Yes, I could create ad-hoc processes in the several affected pipelines to split bash jobs from a specific tool job. It would come with a time cost that would deter users from using Biocontainers, and it would make pipelines less efficient: Nextflow and similar tools use I/O for passing info through processes, which is far less efficient than piping.
True! We need to find a middle ground here. Using Containers in such a way is not efficient in general, but you gain isolation and reproducibility. So I guess we need to find a good way that satisfy both.
So one would want to minimize that. I have several issues with BusyBux minimal implementation of common tools.
Can we collect these somewhere and address them on by one?
For example, ls does not accept the -m option, which I used to easily feed multiple files to tools such as Bowtie2.
-m
the m option from bowtie?
However, many options, such as OSX (or BSD for that matter) are automatically "excluded" from the contest when one deals with Linux-based containers. Seems to me that the GNU tools as captured by coreutils would likely be the overwhelmingly winning implementation if there was a vote, and probably the most reasonable choice.
Exactly, this needs to be discussed. I'm not so sure about it to be honest :( (sed
is not in gnu-coreutils) and even if we do this, when do we update these tools? Do we update old containers with this new version of gnu_coreutils? If not we have inconsistent containers. If we do we have a namespace problem to solve and a maintenance problem.
(What you could do in the meantime is to mount your gnu_coreutils into the container as well, together with your data.)
Hi @bgruening,
a couple of examples of commands I had to rewrite to adapt pipelines to Biocontainers required to drop the -m
option from ls
(handy to add commas to list of files) and the -V
option from sort
(handy for karyotypic rather than lexicographic sort of chromosomes). There are more examples, but I was able to go around the limitations of BusyBox.
Related to this, and probably more importantly, I am trying to have Biocontainers play nicely with a very popular tools in bioinformatics, Nextflow (link). On the Nextflow end, developers have already made some fixes to allow running Biocontainers (see for example this issue). There remain compatibility issues that seems addressable better on the Biocontainers end rather than the Nextflow end (link), specifically it would suffice having the procps
package available in Biocontainers.
Is this something you would consider? This would allow easy integration between Nextflow and Biocontainers, which would be a powerful combo to set up reproducible pipelines in record time. Either project is pretty popular, and if they could play well together it would be empowering.
Thank you.
Hi @bgruening @rspreafico
I will second this proposal. Having at least the procps
package should be enough and will allow also collecting information and metrics from the running tool inside the container. This will be beneficial in turn for performance comparison across different utilities and workloads in the same isolated container environment, something I think it's also in the scope of BioContainers.
Thank you
@rspreafico @fstrozzi I'm not strictly opposed to this, but I would like to keep the containers as minimal as possible. Is there any reason why these matrices should be collected from within the container? You could also have access to it with the Docker cgroups, like described here: https://docs.docker.com/engine/admin/runmetrics/
Hi @bgruening, thanks for your reply, I relayed that to @pditommaso, Nextflow's lead developer. There is a sister issue thread going on on Nextflow's Github here. Please do feel free to liaise directly with @pditommaso to find a common ground to seamlessly integrate these two great projects.
That could be possible, however still won't solve the problem for Singularity and Shifter containers, which do not use cgroup at all. How big is the psproc
package? I would be surprised if it's more than 0.5 MB.
I checked, the package is around 200 kb. @bgruening is this something that you would consider adding to the base image?
@rspreafico can you create a PR for https://github.com/bgruening/docker-busybox-bash I still think its better to use cgrups for Docker and put this logic into the workflow manager and not the Docker container.
@bgruening Thanks for being open to this. I do agree that it would be more elegant that the workflow manager takes advantage of the facilities offered by Docker, but I also agree with @pditommaso that the lack of such facilities with Singularity forces to adopt this solution.
I looked at the base BusyBox image, I will be happy to create a PR. I noticed it builds on the progrium BusyBox, which has a deprecation warning: "This image will probably soon be deprecated in favor of our even smaller Alpine Linux based image. Alpine is a minimal Linux distro designed with containers in mind, based on Busybox, with a real, modern package system".
The Alpine Linux image they refer to, still based on BusyBox, is here. It looks very actively maintained (commits from 6 days ago, vs 3-4 years), the authors state the image is even smaller than the progrium BusyBox, it's based on popular Alpine Linux, and it has apk. So should I take the chance of this PR to also update to Alpine's BusyBox? I checked that I can install both bash
and procps
from within the container with apk add --no-cache procps bash
, so the Dockerfile should be really simple.
Finally, would you like to keep the :latest
tag for this PR, or should I tag a specific version? The latest Alpine Linux is 3.6.
All Biocontainers ship with implementations of basic tools based on non-GNU implementations (BusyBox from 2014). Tools such as sleep, date, ls in these containers do not accept standard arguments and options, breaking pipelines for which Biocontainers could in principle be used as a 1:1 replacement for existing tools. In practice, these non-GNU tools break existing pipelines to an extent that discourages adoption of Biocontainers. For an example, see here:
https://github.com/nextflow-io/nextflow/issues/321
This is one example of many. Would it be possible to ship Biocontainers with a more modern, still lightweight base image?