jakirkham commented 6 years ago

In conda-build 3 it is possible to have multiple package outputs.

This can be useful if a build requires to compile multiple pieces of source together, but is better distributed as separate packages after the fact (e.g. LLVM, gcc, APR, etc.).

However another use case is distributing multiple components of a package with different contents (e.g. devel, debug, so, src, docs, etc.). This also came up in issue ( https://github.com/conda/conda-build/issues/2363 ). This would be useful for a lot of packages where users want extra features (e.g. debug), but also want to have a lean version for Docker images and other deployments (e.g. so). It's also possible to generate a legacy metapackage, which pulls all of the contents back together.

It would be good to figure out what sort of split packages we might want to generate in conda-forge. Also would be good to discuss how we want to integrate their use into other downstream conda-forge packages. Would also be worth thinking about whether there will be any effect on our tooling (e.g. conda-smithy).

cc @conda-forge/core @epruesse

epruesse commented 6 years ago

In general, I think, this whole thing is mainly applicable to packages producing binary object executables and libraries. Dynamically compiled or interpreted software doesn't fit the model very well (until we can ship the .pyc without the .py).

Typical splits

The separation is usually done primarily by "use case". Do I want to install package as User (i.e. I want to run things provided), as Package (I need to link to things provided), as Maintainer (I want to build packages), as Tester (I want to debug package), or as Developer (I want to write dependent code).

Application package

XXX-dev - for Maintainers
- $PREFIX/include/
- $PREFIX/lib/**.{a,la}
XXX-doc - for Developers
- $PREFIX/share/doc/,
- $PREFIX/share/man/
XXX-dbg - for Testers
- $PREFIX/lib/debug/
XXX - for Users
- $PREFIX/

To reduce storage needs for distribution, this one is also common:

XXX-data - shared among architectures
- $PREFIX/share/

Library package

For packages containing major libraries:

libXXX - for Packages
- $PREFIX/lib/**.{so,dylib}
libXXX-NNN - for "older" Packages
- $PREFIX/lib/**.so.NNN
- $PREFIX/lib/**.NNN.dylib
libXXX-tools - for Users
- $PREFIX/bin

These are pretty much what the main stream distros use commonly. Given their level of experience, I think it's a good start to just go with it. There are plenty of reasons. I'll make a case if you want. A good place to start might be the debmake manual as the build tool for much of the dpkg based distros.

Optional GUI package

One additional case is common if part of a package's functionality requires a large set of dependencies to be installed, e.g. in case of optional GUI features:

XXX-common - shared data
XXX-gtk3 - GTK3 gui
XXX-qt5 - QT5 gui
XXX - CLI only

Automation

With regards to automation from conda-smithy - the more the better. I don't know how easy this is to implement, as much is in the hands of conda-build - also, exceptions would have to be accommodated somehow. Technically, if the package installs it's files according to FHS, the -dev, -doc and -dbg parts are easy enough to auto-extract, leaving the rest for the primary package. The -dbg part should be extracted fully automatically as it's too difficult to get right manually and too easy to automate. Some of the build/host/run dependencies could also be derived I believe. I'm not very familiar with the new host: concept yet, so it's hard to say how that plays into it all. Generally, XXX-dev needs libXXX and all other *-dev packages from the build requirements. XXX needs libXXX if present and the lib* packages from build.

mbargull commented 6 years ago

Maybe less common, but -test could also be added to that list. Another aspect is that parts like -data and -doc could potentially be made noarch. But if the main package isn't noarch itself, this would need some special handling in the build matrix to make sure they are only built once.

scopatz commented 6 years ago

In general, I am pretty against splitting up packages unless this is what the upstream projects themselves do. It creates a weird situation where users don't get the software that they ask for.

That said there may be a few core packages that would benefit from being split up (such as the runtime libraries vs compilers).

Alternatively, with a high degree of automation, packages could be split up and then a meta package created that has all of the subpackages as dependencies. I don't know how easy/hard this would be to do in general, but it still seems like the wrong path in general.

ocefpaf commented 6 years ago

I have to say that I on the fence on this one. From a packager point of view splitting makes things easier and lighter. From a user perspective 80% of the time I want the whole software and not worry if I had the -dev/-header,-bin etc.

The metapackage way adds tons of work but would satisfy both worlds.

scopatz commented 6 years ago

Note, I understand the desire for lean containers, which is why I am not categorically against splitting up packages. However, I think the right balance is to limit the splitting to only those packages where it makes sense, and not split all packages always (as some lesser package managers do).

msarahan commented 6 years ago

Splitting off test data is a pretty important use case. I'm not really in favor of always splitting everything, but where it makes a big space difference for splitting things up, it is pretty essential. Runtime libraries follow this same guideline - postgres/libpq; gcc,gxx,gfortran/libgcc,libstdcxx,libgfortran

The flipside of this is that we've been combining several recipes in defaults into single recipes that utilize split packages, where different packages use the same source or versions are otherwise tied. For example, I just merged gstreamer and gst_plugins_base. I think that's not entirely related to this issue, but in case it is, there you go.

epruesse commented 6 years ago

Ok, so arguing the case for splitting packages.

Scope limitations:

This does not apply to software packages comprising only interpreted or dynamically compiled languages.
It makes sense only if the package is sufficiently large or has many dependencies. For a small package with maybe a dozen or two files, a hundred kb total size and dependencies that are commonly installed anyway, the overhead for extra packages clearly eclipses the benefit.

Let's have a look at e.g. qt:

name	pkg size	disk size	files
qt5	45MB	180MB	6372
---	---	---	---
libqt5	35MB	112MB	821
libqt5-dev	5MB	50MB	5463
libqt5-doc	.5MB	2MB	120

85% of the package's files and 28% of it's installed size are only needed for building other packages. That's something most users (ideally) never do - because we do it for them.
No -dbg package because the binaries were already stripped. So here we'd not save space but add the feature that meaningful stack traces can be made available when needed. In my experience, the debug information can easily be 80% of a library's size.

This is typical. I've seen this with many packages. And while the gain in 30% installed size may not seem like much, it does add up. Also, just like the classic novice web dev mistake - "What? Not everyone has 0.01ms because they work on the same box the web server is running on? 200ms round trip for an API call? Oh....", we have to be aware that not everyone works on a power horse developer machine with a fancy SSD that can crank out 300,000 IO/s. More realistic is a scientist on a cluster head node with her/his home directory backed by a poorly managed and completely overloaded NFS server that'll server her/him 100 IO/s on a quiet day. Creating 6000 files can take a minute!

Now multiply with the 50 packages a complex environment needs, and it's not even done after the coffee break (personal experience).

Also, the sort of compromise illustrated by the already stripped libraries is be another point to make in favor of split packages. As a package maintainer, I have to make a careful decision on which package features I build. With splitting, I can enable much more features without feat of bloating the package for everyone. I can have the QT database interface layers (currently not built), separate them into packages each of which depends on the respective database's client API library, etc. etc.

hmaarrfk commented 6 years ago

For python (or interpreted languages) we can split the dependencies without a dramatic increase in the complexity of the recipe. I think adding run-constrained to the core package and a few outputs that require the dependency is good enough. Though it does definitely duplicate the work done for PyPi

See the example for Dask I pulled together a while back. https://github.com/hmaarrfk/dask-feedstock/pull/1

Even on my lab machines, with NVME SSDs and ethernet internet connections, running conda update --all takes a long time to finish that last step.

xhochy commented 5 years ago

Giving this issue a bump again. We have seen recently quite some pull requests (and a blog post) that deal with reducing the size of conda packages, e.g. https://github.com/conda-forge/hdf5-feedstock/pull/112 and https://github.com/conda-forge/mlflow-feedstock/pull/9. Another typical concern also is that the boost-cpp headers are quite large and there it would really make sense to split them off and only install them during the build but not in run-environments.

As we are doing this currently in an decentralised effort, I fear that we will come to similar but slightly different (and thus harder to maintain) approaches. Would there be support in writing a CFEP to standardise this/provide guidance for implementation?

jakirkham commented 5 years ago

SGTM. Would you be interested in championing such a CFEP?

xhochy commented 5 years ago

championing meaning writing? Yes, it will take a bit but I can do that.

jakirkham commented 5 years ago

Yes, that would be incredibly helpful if you were able to take that on. 👍

Maybe we should add this item to our next conda-forge meeting as well?

xhochy commented 5 years ago

@jakirkham When are these meetings and are they public?

ericdill commented 5 years ago

16:30 GMT every other Thursday. There's been a public calendar link posted in a recent PR, though it may not be fully accurate. They are public.

mariusvniekerk commented 5 years ago

My general view (not particularly strong) leans more towards the inverse of this?
We should provide a recommended way to strip the unneeded bits when doing things like building AMIs and containers.

xhochy commented 5 years ago

@mariusvniekerk I'm doing this at the moment but that only works well for the final docker images. I'm running into issues where we later do any consistency checks on these environments like https://github.com/conda/conda-pack/issues/85 or the whole build process for a docker container also takes quite a while as we download a lot of unnecessary data. Simply trimming this down (and maybe adding parallel downloading to conda) would already save me a lot of time on iterating on the containers.

jakirkham commented 5 years ago

In practice I think the biggest issue for users is static libraries showing up in deployment environments. Splitting those out seems pretty reasonable and doesn't lead to a large number of splits.

xhochy commented 4 years ago

Instead of one mega CFEP that will never be finished, here's a CFEP just for static libraries: https://github.com/conda-forge/cfep/pull/34

conda-forge / conda-forge.github.io

Split packages #544

Typical splits

Application package

Library package

Optional GUI package

Automation