Biocontainer images are HUGE

epruesse commented 5 years ago

I just ran

mulled-build build circexplorer2=2.3.6--py_0 --involucro-path `which involucro` --channels conda-forge,https://53580-42372094-gh.circle-artifacts.com/0/tmp/artifacts/packages,bioconda,defaults

to see check how big those containers are. Turns out they are massive. It took several minutes, failed during cleanup with some permission error (couldn't delete libasan), but I did get an image: 480MB in the top layer, 544 MB in the build/dist folder it left behind.

Out of 544MB, about 270MB can be freed by:

deleting *.a files
deleting __pycache__ directories
striping binaries (remove debug symbols)

Of the remaining 270MB, we have

pandas 38MB (unstripped 50)
numpy 15MB (unstripped 20)
scipy 54MB (unstripped 65)
openblas 30 MB
pysam 7MB
Bunch of standard libs, libasan, libtsan, sqlite, libstdc++, libtk, libtcl, tk, tcl, libcrypto, libpython.

So with some sensible choice of base image and some post-build cleanup of the image, the top layers could probably be compacted by ~90% total.

mbargull commented 5 years ago

Yeah, I've been meaning to ponder about this for a long time. My take thus far is:

deleting *.a files

Yes! Pretty sure almost no application will need them. In the unlikely case that one does, we should offer an escape hatch to only selectively remove those files.

deleting __pycache__ directories

No, we shouldn't do that. By doing that you can easily add a second or more startup time to a container. Yes, the .pyc compilation vs. downloading a larger container image could cancel each other out, but the latter is a one-time cost whereas compilation would happen every time you run a Python application from a fresh container of the same image.

striping binaries (remove debug symbols)

Yes! In most cases, the packaged applications are expected to run without crashing and thus without the need for debugging symbols. However, it would still be good to somehow offer debuggable applications. What I have in mind is to offer smaller prebuilt containers for general use, but make "debug versions" available on demand. Here, the "debug versions" don't have to be provided in the same convenient manner as the main, smaller ones. Meaning, for each container we could offer build scripts (= Dockerfiles, likely, and pinned-down list of packages/conda version etc. to make them reproducible) that can skip the stripping and other steps.

[...] pandas [...] numpy [...] scipy [...] openblas [...] pysam [...] Bunch of standard libs [...]

Sosome sensible choice of base image

Ha! And that's the tricky one! Given how container layers work, this is not trivial at all. If you have ideas on that, I'd be interested to hear them! Otherwise we could look out for works others have done on this. The last thing I stumbled upon was https://grahamc.com/blog/nix-and-layered-docker-images at the time, which is about the same problem but with Nix. I assume it's a bit messier with Conda since Nix has more well defined in- and outputs... I wasn't able to dig into this, yet though (and probably won't any time soon, too :/ ).

bgruening commented 5 years ago

deleting *.a files

Yes. I think conda-forge has discussed this as well.

deleting pycache directories

I'm also on the fence here.

striping binaries (remove debug symbols)

We could do this optional by adding an extra flag isn't it? And maybe enable it by default?

I have a more general question. Why are we separating containers and packages here? I think it has its own value to say that the conda packages is actually what is in this container and not shipping two different things. In the same way I think it has a lot of value that we try to keep the conda env small. So I would discourage to diverge both worlds too much.

Sosome sensible choice of base image

I think this will enable a new can of worms. You would need to have different base images for different languages, we would need to update them very often, probably with every pinning update. We have chosen a very minimal image to exactly avoid this. An other point is that we do not have conda in those images, so adding a second layer without conda will be tricky. And then we are back again that the envs might be to different between a conda env and a container.

mbargull commented 5 years ago

We could do this optional by adding an extra flag isn't it? And maybe enable it by default?

It's a reoccurring topic where people have divergent opinions. Some people from Anaconda and conda-forge raised concerns about reduced debuggability and such. Some discussions are at https://github.com/conda-forge/conda-forge.github.io/issues/520 and https://github.com/conda-forge/conda-forge.github.io/issues/544. I'd be in favor of splitting packages, offering debug versions and such. However, this not only concerns packages from bioconda, but also conda-forge (defaults) and thus would be good to have coordinated in the whole community. (When we had those split/debug packages, some technicalities/usability questions in conda itself should also be considered, i.e., what is the best/most consistent/convenient way to say give me the equivalent of conda create -nenv pkgs ... but choose debug variants of all packages [spoiler alert: features: debug is not what we want])

Why are we separating containers and packages here?

Just because with containers -- apposed to with Conda packages themselves (also environments due to potentially shared inodes etc.) -- we have to store/download their dependencies exceedingly redundantly.

it has its own value [...] not shipping two different things. [...] So I would discourage to diverge both worlds too much.

Fair point! If storage and such is no big concern for us or the users, then I agree, consistency is desirable of course.

I think this will enable a new can of worms.

Can't argue with that :wink:.

epruesse commented 5 years ago

FYI: particularly large example is antismash

epruesse commented 5 years ago

W.r.t. stripping and .a and so on:

Personally, I don't see why we should not follow the approach all normal Linux distros follow: Everything needed only for building software (.a, .h, ...) goes into a -dev or -devel package (Debian/Redhat, respectively). Everything needed only to debug a problem goes into a -dbg package. Neither has a place in the containers. If you need to debug or build, go and install the packages. The container is for reproducible production use, and any building of software will require combining packages in a way that docker layering alone cannot do in any case.

Barring -dev and -dbg and -doc packages separating out things that are just bloat for many use cases, but too valuable to just toss, I'd say that at least in the containers we can drop those things.

No-one will try to build software in one of those containers. They lack gcc. If you start installing things, you might as well reinstall the package. So no need for .a and .h files.

mbargull commented 5 years ago

FYI: particularly large example is antismash

Particularly large, yes, but not very representative since it's one of those "special cases" where the bulk of its size is due to multi-gigabyte database downloads in the post-link script. The usual "dead freight" of static libraries, debugging symbols, source files, etc., is only a smaller part for that container. I'm not saying it necessarily makes sense to put that data into the container, of course. It probably makes sense for that particular package to be split into:

```yaml package: name: antismash # alt: antismash-whatever-all-the-things ... requirements: run: - antismash-core version # or something other/better than "core" #alt: antismash - antismash-data version ... ``` ```yaml package: name: antismash-core # alt: antismash ... requirements: host: - ... run: - ... ... ``` ```yaml package: name: antismash-data ... requirements: host: - antismash version run: [] ... # + post-link script ``` (Assuming their database download only adds files and does not modify/clobber existing files.) (The `-data` package could then optionally be versioned differently, i.e., less often updated if the data changes more infrequently.)

epruesse commented 5 years ago

At least it does the download in post-install, not on the fly (creating a mess on a cluster trying to run it multiple times on first run... quast does that IIRC).

bioconda / bioconda-utils

Biocontainer images are HUGE #511