BioContainers / containers

Bioinformatics containers
http://biocontainers.pro
Apache License 2.0
674 stars 246 forks source link

Contributing lighter images ? #558

Closed jfouret closed 6 months ago

jfouret commented 7 months ago

Hi,

Thanks you for the bio containers initiative. I am a regular user of biocontainers images and sometimes I found images to be too heavy therefore I am used to re-define the images in a local private registry.

I would like to contribute to biocontainers by sharing lighter images; however I am not sure whether it's of interest or not for the community.

Of note, the large majority of images available in this project are well-optimized and light but some other, mostly for software that are less used, are not.

I did not find this topics to be discussed in other issues or in the documentation (Sorry if I missed something).

For example for the macse software, we can reduce the image size by more than 2-fold:

REPOSITORY                              TAG                        IMAGE ID       CREATED         SIZE
ghcr.io/nexomis/macse                   v2.07.0                    853dde3577de   2 days ago      205MB
public.ecr.aws/biocontainers/macse      2.07--hdfd78af_0           bca6c95909e5   10 months ago   531MB

This is just an example for me to understand what is the policy for contributing.

I think it's not a trivial question since if you accept new container images with the same software version while the only change is the size-optimization of the image, it might have the effect of increasing the number of image to store and increase costs.

Thanks,

mboudet commented 7 months ago

Hmm. The issue here will be that we do not remove 'old' images, so adding lighter ones would increase the charge on our side. I guess it would depend on how many images we are talking about?

Nonetheless, if you have a reliable process to reduce image sizes, we could integrate it to the documentation, and use it going forward. (I have been meaning to rewrite the docs for some time now..)

jfouret commented 7 months ago

Hi,

In my last job I had a lot, now I am rewriting some and for now I have only this one for macse. Basically I took the habit to try building a new image when its size is over 100MB on biocontainers or elsewhere. I am using a lot elastic cloud computing and workflow technologies so the pull time is a topic since images are not stored locally in ephemeral instances.

When hosting images the cost of storage is important but the cost of pulling (sending data to internet) image could also a topic. Therefore I was expecting a criteria like a minimum 2 fold gain or 50 MB gain in size (to be ajdusted, it's just an example). Those criteria would apply when pushing a new image version with the solely purpose of reducing the image size (without new software version). It also depends on how often the image are pulled.

To outline the fact that pulling costs are also of importance when sharing images, on ECR storage costs $0.10/GB and pull $0.09 per GB. Rationalizing those costs might facilitate the long-term maintenance of those containers images. (https://aws.amazon.com/ecr/pricing/).

Of note, the size of images matters also in other contexts, (slow internet, costly internet fees in some countries).

Regarding guidelines, very good and inspiring principles are already implemented in more than 95% of the Dockerfile I have looked at. There are already some points regarding optimization in the docs (https://biocontainers-edu.readthedocs.io/en/latest/best_practices.html#optimization). Here some points that I would emphasize and why:

  1. Basic container guidelines that are already cited :

    • Use light images ubuntu, debian:stable-slim or alpine.
    • Empty cache and remove temporary file in the same layer (same RUN), and remove software not used in runtime such as wget (not always done but the gain in size is marginal).
    • do not install recommended package with apt (default option is to install recommended packages : cf https://ubuntu.com/blog/we-reduced-our-docker-images-by-60-with-no-install-recommends).
    • use multi-staged build / avoid build tools in final image.
    • Maybe in the case of biocontainers base image that is "frozen" (outdated in comparison to classical docker base images) one might use something like "--no-upgrade" apt-get option to prevent a lot of package upgrades but maybe it's disabled by default.
    • When using apt-get / pip or any package installer it would be good to add the package version at least for end-use tools.
  2. Sharing optimized base layers. To globally reduce the cost of storage and pull the overall biocontainers infra it's a good idea to have a pool of base images frozen. Enforcing this will make all base image share the same base layers and spare storage and pull. There are already ~95% Dockerfile following this strategy with the biocontainers/biocontainers base image.

  3. Include base image specific for each runtime used in the community (java, perl, python, r). For scripts or java jar, it would be a good idea to use specific base image (e.g. bioconatainers/biocontainers:jre8-xxxx) and enforce their use in Dockerfile. For example openjdk-jre-headless is installed in some containers images (cf fastqc) whereas layers with this runtime exec might be shared.

  4. Manage software with lots of dependencies in a base image. If a software that has many dependencies (e.g. softwareX) it might be a good id to make a base image with all dependencies for softwareX (image baseX). Therefore, when we want to add new versions of softwareX we can re-use the same base image and doing so sharing many layers and save storage (when updating the softwareX version it would be important to not update the dependencies). This would allow for incrementing version at a lower cost.

  5. Ban heavy databases. There is a regular issue in some software in bio-informatics: including databases in docker image. While it's a good idea to make a tool more ready to use it's problematic in a context where the sizes of images matter. An example of this is prokka where the databases are included. I think it would be a good idea to enforce a policy where databases would not be included in container (except for light database such as databases of short read sequence adapters). I think that there must be other strategies to provide databases to end users such as this one: https://benlangmead.github.io/aws-indexes/k2.

  6. Limit auto build of bioconda package when images reach a size more than a threshold. I do not fully understand the whole biocontainers infrastructure but I am under the impression that a lot of images are automatically built from bioconda. While it's a good idea to make a lot of tools rapidly available it has a cost on storage and therefore the long term availability of all biocontainers images. Conda recipe are not written at all with the purpose of reducing the image size. Of note there are conda recipe that are well optimized such as miniprot for which the image size is not that big.

Hoping that my thoughts are welcomed,

Bests,

mboudet commented 6 months ago

Thanks for your input! These are good advice, and I'll try to take them into account for the next PRs this repo will receive. I should probably update the github action to provide the image size during the test, to avoid being blindfolded.

Regarding bioconda images, we are not managing these. The docker images are automatically built from conda packages, and are managed in https://github.com/bioconda/bioconda-recipes

jfouret commented 6 months ago

Ok. Just to be sure before to close:

  1. If there is a bioconda recipe then there is no need for a Dockerfile here ?
  2. In case of problem with the container size coming from bioconda one need to open an issue to bioconda repository ?

More specifically about macse image or - I think a lot - of java-based images (including fastqc): The conda recipe seems to define a dependence on the whole openjdk with dev tools and not only the run time environment. As shown below there is a huge difference between the runtime only (~60MB) and the whole openjdk (~400MB) providing the java compiler program in addition to java runtime exec. However it seems there is not a properly maintained recipe for java runtime.

java runtime only

(base) root@bef448039206:/# conda install java-1.7.0-openjdk-headless-cos7-x86_64
...SKIPPED...
                                           Total:        64.6 MB

whole openjdk

(base) root@bef448039206:/# conda install openjdk 
...SKIPPED...
                                           Total:       415.0 MB
(base) root@bef448039206:/# ls /opt/conda/bin/ | grep java
java
javac
javadoc
javap
mboudet commented 6 months ago

Yes, this repository is usually used for tools not included in conda.

Might be better to contact the bioconda repo directly yes, I'm not really aware of their build process. I believe there are different dependencies for different stages (build & run), but I don't know more.