Open jakirkham opened 6 years ago
In general, I think, this whole thing is mainly applicable to packages producing binary object executables and libraries. Dynamically compiled or interpreted software doesn't fit the model very well (until we can ship the .pyc
without the .py
).
The separation is usually done primarily by "use case". Do I want to install package
as User (i.e. I want to run things provided), as Package (I need to link to things provided), as Maintainer (I want to build packages), as Tester (I want to debug package), or as Developer (I want to write dependent code).
$PREFIX/include/
$PREFIX/lib/**.{a,la}
$PREFIX/share/doc/
,$PREFIX/share/man/
$PREFIX/lib/debug/
$PREFIX/
To reduce storage needs for distribution, this one is also common:
$PREFIX/share/
For packages containing major libraries:
$PREFIX/lib/**.{so,dylib}
$PREFIX/lib/**.so.NNN
$PREFIX/lib/**.NNN.dylib
$PREFIX/bin
These are pretty much what the main stream distros use commonly. Given their level of experience, I think it's a good start to just go with it. There are plenty of reasons. I'll make a case if you want. A good place to start might be the debmake
manual as the build tool for much of the dpkg
based distros.
One additional case is common if part of a package's functionality requires a large set of dependencies to be installed, e.g. in case of optional GUI features:
With regards to automation from conda-smithy
- the more the better. I don't know how easy this is to implement, as much is in the hands of conda-build
- also, exceptions would have to be accommodated somehow. Technically, if the package installs it's files according to FHS, the -dev
, -doc
and -dbg
parts are easy enough to auto-extract, leaving the rest for the primary package. The -dbg
part should be extracted fully automatically as it's too difficult to get right manually and too easy to automate. Some of the build/host/run dependencies could also be derived I believe. I'm not very familiar with the new host:
concept yet, so it's hard to say how that plays into it all. Generally, XXX-dev
needs libXXX
and all other *-dev
packages from the build requirements. XXX
needs libXXX
if present and the lib*
packages from build.
Maybe less common, but -test
could also be added to that list.
Another aspect is that parts like -data
and -doc
could potentially be made noarch
. But if the main package isn't noarch
itself, this would need some special handling in the build matrix to make sure they are only built once.
In general, I am pretty against splitting up packages unless this is what the upstream projects themselves do. It creates a weird situation where users don't get the software that they ask for.
That said there may be a few core packages that would benefit from being split up (such as the runtime libraries vs compilers).
Alternatively, with a high degree of automation, packages could be split up and then a meta package created that has all of the subpackages as dependencies. I don't know how easy/hard this would be to do in general, but it still seems like the wrong path in general.
I have to say that I on the fence on this one. From a packager point of view splitting makes things easier and lighter. From a user perspective 80% of the time I want the whole software and not worry if I had the -dev/-header,-bin etc.
The metapackage way adds tons of work but would satisfy both worlds.
Note, I understand the desire for lean containers, which is why I am not categorically against splitting up packages. However, I think the right balance is to limit the splitting to only those packages where it makes sense, and not split all packages always (as some lesser package managers do).
Splitting off test data is a pretty important use case. I'm not really in favor of always splitting everything, but where it makes a big space difference for splitting things up, it is pretty essential. Runtime libraries follow this same guideline - postgres/libpq; gcc,gxx,gfortran/libgcc,libstdcxx,libgfortran
The flipside of this is that we've been combining several recipes in defaults into single recipes that utilize split packages, where different packages use the same source or versions are otherwise tied. For example, I just merged gstreamer and gst_plugins_base. I think that's not entirely related to this issue, but in case it is, there you go.
Ok, so arguing the case for splitting packages.
Scope limitations:
Let's have a look at e.g. qt
:
name | pkg size | disk size | files |
---|---|---|---|
qt5 | 45MB | 180MB | 6372 |
--- | --- | --- | --- |
libqt5 | 35MB | 112MB | 821 |
libqt5-dev | 5MB | 50MB | 5463 |
libqt5-doc | .5MB | 2MB | 120 |
-dbg
package because the binaries were already stripped. So here we'd not save space but add the feature that meaningful stack traces can be made available when needed. In my experience, the debug information can easily be 80% of a library's size.This is typical. I've seen this with many packages. And while the gain in 30% installed size may not seem like much, it does add up. Also, just like the classic novice web dev mistake - "What? Not everyone has 0.01ms because they work on the same box the web server is running on? 200ms round trip for an API call? Oh....", we have to be aware that not everyone works on a power horse developer machine with a fancy SSD that can crank out 300,000 IO/s. More realistic is a scientist on a cluster head node with her/his home directory backed by a poorly managed and completely overloaded NFS server that'll server her/him 100 IO/s on a quiet day. Creating 6000 files can take a minute!
Now multiply with the 50 packages a complex environment needs, and it's not even done after the coffee break (personal experience).
Also, the sort of compromise illustrated by the already stripped libraries is be another point to make in favor of split packages. As a package maintainer, I have to make a careful decision on which package features I build. With splitting, I can enable much more features without feat of bloating the package for everyone. I can have the QT database interface layers (currently not built), separate them into packages each of which depends on the respective database's client API library, etc. etc.
For python (or interpreted languages) we can split the dependencies without a dramatic increase in the complexity of the recipe. I think adding run-constrained
to the core
package and a few outputs that require the dependency is good enough. Though it does definitely duplicate the work done for PyPi
See the example for Dask I pulled together a while back. https://github.com/hmaarrfk/dask-feedstock/pull/1
Even on my lab machines, with NVME SSDs and ethernet internet connections, running conda update --all
takes a long time to finish that last step.
Giving this issue a bump again. We have seen recently quite some pull requests (and a blog post) that deal with reducing the size of conda packages, e.g. https://github.com/conda-forge/hdf5-feedstock/pull/112 and https://github.com/conda-forge/mlflow-feedstock/pull/9. Another typical concern also is that the boost-cpp
headers are quite large and there it would really make sense to split them off and only install them during the build but not in run-environments.
As we are doing this currently in an decentralised effort, I fear that we will come to similar but slightly different (and thus harder to maintain) approaches. Would there be support in writing a CFEP to standardise this/provide guidance for implementation?
SGTM. Would you be interested in championing such a CFEP?
championing
meaning writing
? Yes, it will take a bit but I can do that.
Yes, that would be incredibly helpful if you were able to take that on. 👍
Maybe we should add this item to our next conda-forge meeting as well?
@jakirkham When are these meetings and are they public?
16:30 GMT every other Thursday. There's been a public calendar link posted in a recent PR, though it may not be fully accurate. They are public.
My general view (not particularly strong) leans more towards the inverse of this?
We should provide a recommended way to strip the unneeded bits when doing things like building AMIs and containers.
@mariusvniekerk I'm doing this at the moment but that only works well for the final docker images. I'm running into issues where we later do any consistency checks on these environments like https://github.com/conda/conda-pack/issues/85 or the whole build process for a docker container also takes quite a while as we download a lot of unnecessary data. Simply trimming this down (and maybe adding parallel downloading to conda) would already save me a lot of time on iterating on the containers.
In practice I think the biggest issue for users is static libraries showing up in deployment environments. Splitting those out seems pretty reasonable and doesn't lead to a large number of splits.
Instead of one mega CFEP that will never be finished, here's a CFEP just for static libraries: https://github.com/conda-forge/cfep/pull/34
In
conda-build
3 it is possible to have multiple package outputs.This can be useful if a build requires to compile multiple pieces of source together, but is better distributed as separate packages after the fact (e.g. LLVM, gcc, APR, etc.).
However another use case is distributing multiple components of a package with different contents (e.g. devel, debug, so, src, docs, etc.). This also came up in issue ( https://github.com/conda/conda-build/issues/2363 ). This would be useful for a lot of packages where users want extra features (e.g. debug), but also want to have a lean version for Docker images and other deployments (e.g. so). It's also possible to generate a legacy metapackage, which pulls all of the contents back together.
It would be good to figure out what sort of split packages we might want to generate in conda-forge. Also would be good to discuss how we want to integrate their use into other downstream conda-forge packages. Would also be worth thinking about whether there will be any effect on our tooling (e.g.
conda-smithy
).cc @conda-forge/core @epruesse