NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.14k stars 13.42k forks source link

dockerTools.buildLayeredImage creates skewed sized layers #48462

Open coretemp opened 5 years ago

coretemp commented 5 years ago

Issue description

Recently dockerTools.buildLayeredImage ( https://grahamc.com/blog/nix-and-layered-docker-images ) functionality was merged into master.

It creates docker images with layers of a size with a distribution that is very skewed. See below for more information.

Steps to reproduce

let
  old_pkgs = import <nixpkgs> {};
  pkgs  = <snip (assumes that it points at master> ;
 inherit (pkgs) stdenv;

in pkgs.dockerTools.buildLayeredImage {
   contents =  [pkgs.bashInteractive old_pkgs.curl old_pkgs.jq old_pkgs.emacs old_pkgs.git
                old_pkgs.cacert old_pkgs.openssh old_pkgs.gnugrep
                old_pkgs.coreutils
                old_pkgs.which
                old_pkgs.nix
                old_pkgs.gnused old_pkgs.file old_pkgs.gawk];

  name = "foo/bar";
  tag = "latest";
  config.Cmd = [ "/bin/bash" ];
  maxLayers = 25;
}

The layers looks like:

b6cdec945d1b: Loading layer [==================================================>]    477MB/477MB
86e60f3ed08f: Loading layer [==================================================>]  174.1kB/174.1kB
79d24adb2c31: Loading layer [==================================================>]  348.2kB/348.2kB
9354bd19facd: Loading layer [==================================================>]  12.19MB/12.19MB
611dae553d72: Loading layer [==================================================>]  1.178MB/1.178MB
cb916eb69cd1: Loading layer [==================================================>]  71.68kB/71.68kB
ee38b4411beb: Loading layer [==================================================>]    1.7MB/1.7MB
05bad8e60f92: Loading layer [==================================================>]    809kB/809kB
d30df3eb574c: Loading layer [==================================================>]    768kB/768kB
bb3ba7943090: Loading layer [==================================================>]  40.96kB/40.96kB
e7e6d2f94b2e: Loading layer [==================================================>]  5.612MB/5.612MB
15048416f85e: Loading layer [==================================================>]  1.526MB/1.526MB
b00ca1c01fd7: Loading layer [==================================================>]    512kB/512kB
0518754facbc: Loading layer [==================================================>]  92.16kB/92.16kB
13859dfd2c8b: Loading layer [==================================================>]  266.2kB/266.2kB
e696a9248983: Loading layer [==================================================>]    256kB/256kB
560c99588b76: Loading layer [==================================================>]  81.92kB/81.92kB
dcf8aeee86d0: Loading layer [==================================================>]   51.2kB/51.2kB
9b58d0959ca7: Loading layer [==================================================>]  2.898MB/2.898MB
9caed6f91ab9: Loading layer [==================================================>]  112.6kB/112.6kB
e7379e4a6dcb: Loading layer [==================================================>]  1.352MB/1.352MB
4a5ae92f18dc: Loading layer [==================================================>]  40.96kB/40.96kB
d91816088e26: Loading layer [==================================================>]  40.96kB/40.96kB
7ddbb507422c: Loading layer [==================================================>]  21.79MB/21.79MB

As you can see the size is unreasonably distributed (a 477MB layer with a bunch of tiny ones).

My question is: is this the intended behavior? If not, can we have tests that make sure the distribution of the size of the layers follows a particular (statistical) distribution.

Relevant people

@graham-at-target

Technical details

graham-at-target commented 5 years ago

This is the expected behavior. The layers, big or small, are prioritized by likelihood they are going to be cache hits. I recommend changing your maxlayers to something closer to 100, and will be updating the documentation about this.

coretemp commented 5 years ago

Thank you.

What is the equivalent of RUN useradd -r -u 1001 -g appuser appuser when using such images? I would like the root user to exist in /etc/passwd such that I can set config.User="root";

graham-at-target commented 5 years ago

You can add a passwd file, like this:

        userSetup = self.runCommand "user-setup" { } ''
          mkdir -p $out/etc/

          echo "root:x:0:0:root user:/root:${pkgs.bash}/bin/bash" >> $out/etc/passwd
        '';

and then add userSetup to the contents list. Edit the shell and home paths as needed :)

coretemp commented 5 years ago

That works great. Only issue I seem to have now is that /usr/bin/env is not part of the image. /sbin/env is.

coretemp commented 5 years ago

You probably want to add

echo "root:x:0:" >> $out/etc/group

too

graham-at-target commented 5 years ago

You might be able to make this work via:

env-shim = pkgs.runCommand "env-shim" {} ''
  mkdir -p $out/usr/bin
  ln -s ${pkgs.coreutils}/bin/env $out/usr/bin/env
'';

and putting env-shim in contents

(fixed according to the comment below)

coretemp commented 5 years ago

Thanks, you missed a {}, but otherwise this helps a lot.

coretemp commented 5 years ago

Not sure if I am pushing it if I also ask this question here:

When trying to build nixops inside a docker container (nix build -f release.nix build.x86_64-linux):

warning: the group 'nixbld' specified in 'build-users-group' does not exist
graham-at-target commented 5 years ago

pushing it

A little bit :)

You might want to check out https://hub.docker.com/r/lnl7/nix/ / https://github.com/lnl7/nix-docker which is for running nix inside the container.

The little bit I do know, though, is you probably need to run nix not as root so it doesn't try to use build users or features which require the ability to fiddle with the kernel.

coretemp commented 5 years ago

I could use that image as a base, I suppose. I was using that image until I switched to your stuff. My bad.

coretemp commented 5 years ago

@graham-at-target Is it possible to build on top of https://github.com/LnL7/nix-docker while still using your approach too?

chreekat commented 5 years ago

Yo, I recommend closing this issue and opening new ones for new issues. :)

tomberek commented 4 years ago

I've got a WIP: that allows one to hint at how many layers to put where: https://github.com/tomberek/nixpkgs/blob/tom/docker_hints/pkgs/build-support/docker/default.nix#L299

Usage would be something like:

let
repeat = n : a : accum : if n == 0 then accum else repeat (n - 1) a (accum ++ [a]);
r = n : a : repeat n a [];  
in
buildLayeredImage {
hints = (r 5 10 ) ++ ( r  100 1)
...
}

... meaning to create 5 layers of 10 paths each, then up to the next 100 layers having 1 path each. The order is the same as how the current system works. This just groups them differently.

stale[bot] commented 4 years ago

Hello, I'm a bot and I thank you in the name of the community for opening this issue.

To help our human contributors focus on the most-relevant reports, I check up on old issues to see if they're still relevant. This issue has had no activity for 180 days, and so I marked it as stale, but you can rest assured it will never be closed by a non-human.

The community would appreciate your effort in checking if the issue is still valid. If it isn't, please close it.

If the issue persists, and you'd like to remove the stale label, you simply need to leave a comment. Your comment can be as simple as "still important to me". If you'd like it to get more attention, you can ask for help by searching for maintainers and people that previously touched related code and @ mention them in a comment. You can use Git blame or GitHub's web interface on the relevant files to find them.

Lastly, you can always ask for help at our Discourse Forum or at #nixos' IRC channel.

fare commented 4 years ago

I propose that the buildLayeredImage tool have one or two extra parameters that control grouping of several packages at once per layer. Thus, in a 178-package image, instead of having 98 layers each with one package, then one huge layer with 79 packages, we'd have 80 layers with 2 packages each, then 18 layers with one package each. Ideally the "last / top" packages explicitly requested by the programmer, that are presumably most subject to change, would be among the single-package-per-layer ones, or fewer-packages-per-layer ones.

fare commented 4 years ago

See the discussion here between Fare and gchristensen: https://logs.nix.samueldr.com/nixos/2020-07-24#3773968

Another feature that could help alleviate the pain would be if buildLayeredImage accepted a buildImage as initial input and/or let me specify some of the exact layers I want, with e.g. "just this package" or "all these packages", then filled the blanks.

tomberek commented 4 years ago

@fare : proof of concept https://github.com/tomberek/nixpkgs/blob/tom/docker_hints/pkgs/build-support/docker/default.nix#L299 that i've successfully used in the past to give me control over the grouping of the layers. I called it "hints" and normally would do sets of 20 packages. I found it helpful to also give very large packages their own layer to help even out the sizes (improving parallel download, processing speed).

This is probably bit rot now that dockerTools has been updated, but the concept may survive.

For discussion:

If both options are given, both would apply? So you can end up with something like:

[ 8 2     10     10     3 1 6     1 1 1 1 1 1 1 1 1 1 1 1 1 1]

If the first grew to be above 1G and last set of 10 had a single package that was almost 1G.

Edit:

roberth commented 3 years ago

For a new function, another possible approach is to let the user determine the grouping using an algorithm that takes

Emphasis is on successive, because this does not try to deduplicate much between dissimilar image with some matching store paths.

It could then fold over splitPaths: fold f [wholeClosure] splitPaths, iterating on the layers starting with wholeClosure. f will split each layer into potentially three layers. First, we check whether splitPath occurs in the layer. If not, we return the layer as is. Then, we compute the intersection between the splitPath closure and the closure of the layer after removing splitPath itself. This represents the commonalities. With this intersection, we can return the layers

This looks like it has O(3^#splitPaths) worst case complexity, but fear not: the complexity is also bounded by O(#items(wholeClosure) * #splitPaths) because no store path is duplicated. A clever implementation could bring the average down further, but I don't think we have a performance issue here.

Comparing to buildLayeredImage

Because this solution operates entirely on set operations of closures specified by humans, instead of low-level store path based heuristics, it will only split layers across fairly high-level boundaries. Another way to see this is if you'd refactor a derivation by splitting out a dependency, you won't cause new layer, unless it happens to introduce a new commonality with another splitPath or the whole closure, which is rare.

This approach will not guarantee reuse between distinct images (ie not versions of the same image), although it may still cause some happy collisions.

Comparing to buildImage

buildImage itself creates a single layer, so we should be comparing to a chain of buildImage, which behaves like a Dockerfile. Besides the downside of having to copy files over and over when low layers change, it can only find commonalities with entire previous layers. It's not always feasible to find these commonalities, unless you want to explicitly specify layers for low-level store paths like for your libc, or some set of libraries that should change as your image evolves but will most likely bitrot.

You lose a little bit of control using this solution, but it seems unlikely that it will cause the number of layers to increase by too much. If it does, you can reduce the number of splitPaths.

Further optimization: split "configuration"

This is somewhat independent of the algorithm I'm proposing and can probably be applied to the other approaches as well. One problem that the algorithm may suffer from is having too few splitPaths, causing large "remainder" layers which change too often. Although the prior can only be solved by deferring to another clever algorithm like @tomberek's or the existing one; we can improve the latter by detecting which files are "configuration". Small files near the top of the closure are typically configuration files like /etc contents, startup scripts and such that don't benefit from multiple layers, yet change more often than the big "remainder" layer. By taking those out of that layer, we reduce number of versions of that big layer.

This is similar to the "customization layer" but also applies to store paths and it's automatic. Maybe it's not similar, except for the name.

Erroneous splitPaths

Despite their small number, splitPaths still have to be maintained. Depending on how the expression is design, by accident, one of the splitPaths could start referring to a path that's not in the actual image closure. This should be reported as an error, both to avoid accidental inclusion of irrelevant paths, and to provide an opportunity to fix the problem.

Call to action

This idea has been sitting in the back of my mind for a while, but I don't suffer from this issue. So as much as I'd like to see an implementation, and I hate to say it, I can't justify implementing it as things stand. So if anyone is looking for an interesting project, or wants to solve the problem really bad, please give it a shot, ask me if anything is unclear and I'd be happy to review.

adrian-gierakowski commented 3 years ago

@roberth I'd like to work on your proposal, starting later this week or beginning of next. I think it might be useful to have a quick chat to make sure I understand your idea and that it solves the problems I've been wanting to solve. Let me know if you'd be able to hop on a skype call any time this week. You can dm me on https://discourse.nixos.org/ is necessary. Thanks!

bhpdt commented 3 years ago

Apologies @adrian-gierakowski if you've been hacking away on this locally, but we've actually implemented a solution to this issue a week ago (without seeing this issue, somehow).

We had the same problem as the OP: an enormous final layer with essentially all the dependencies, and a tiny change to the application has to re-ship that big layer. We see three potential heuristics when the number of packages exceeds the layers:

  1. Prefer the lowest dependencies (current behavior). This maximizes sharing of packages like glibc across different applications.
  2. Prefer the highest dependencies. This minimizes the diffs to upload/download between iterative releases of the same application.
  3. Use some domain specific knowledge to group more intelligently. (*)

We have implemented (1) and (2) as selectable out-of-the-box algorithms in buildLayeredImage, and provided a hook for users to implement (3) relatively easily. Let us know what you think!

In our case, switching from (1) to (2) reduced cycle iteration times in kubernetes by more than a full order of magnitude.


(*) Another thing we tried, mentioned above, was to split evenly. For example, with 500 pkgs and 100 layers, put 5 pkgs in each layer; but upon further consideration this does not have the cross-app sharing properties of (1), and is just less efficient than (2) for one app, so we scrapped it.

adrian-gierakowski commented 3 years ago

@bhpdt no worries, I only spent a couple of hours last week on this. Anyway the work I'm about to do is mostly orthogonal to what you've done (although it's trying to address similar problem).

stale[bot] commented 2 years ago

I marked this as stale due to inactivity. → More info

jrsala-auguration commented 2 years ago

In my team we've had a similar issue and I had an idea for a solution to a different but closely related problem: when you know the set of images that need to be built and therefore have a full view of all the packages that are going to make it into Docker images.

With that assumption, it's possible to (attempt to) search for a globally optimal layering that solves many issues at once, rather than merely hoping the layering scheme that was used for one image is going to be of help in another. Maybe this can be yet another layering strategy. I'm not sure it would fit into @adrian-gierakowski's PR however, because it likely could not be expressed within the DSL. It's also possible this is overly complex and that @roberth's and @bhpdt's very pragmatic ideas yield better result.

Since my team is slowly moving away from having many Docker images I have not dedicated time to writing code for this and I can only contribute an idea at this point:

The fundamental idea is to create layers such that each layer comprises packages required by the same images and only by those images. So there's this idea of minimalism, and of a "canonical" layering. Then, if the resulting layering breaks the maximum number of layers allowed for a given image, it's possible to perform merge operations (described further below) on those layers.

Formally, suppose you want to build a known, finite set I of images. Let P be the union of the closures of the contents of the images in I (i.e. the set of all packages to be distributed among layers), and Gp the dependency graph of the packages in P. Let R be the relation over P such that pRq if and only if the set of images that require package p is equal to the set of images that require package q. R is an equivalence relation. Then, the candidate layers are the nodes of the (directed) quotient graph Gp/R. The layers to be used to build a final image are those whose contents are required by that image.

Before going any further, here's a practical example: suppose you want to make images 1, 2 and 3 with the following contents:

where the packages in P have the following dependency graph (with arrows pointing down so dependents are on top and dependencies are towards the bottom):

a   b   c
 \ / \ / \
  d   e   f
 / \ / \ /
g   h   i

There are several ways to compute the quotient graph. An intuitive but sub-optimal one is to first compute the mapping of packages to sets of images that require them:

Package Images that need it
a {1}
b {1}
c {2}
d {1}
e {1, 2}
f {2, 3}
g {1}
h {1, 2}
i {1, 2, 3}

and then to invert that mapping and regroup the packages corresponding to a given set of images:

Set of images Packages that are required only by those images
{1} [a, b, d, g]
{2} [c]
{1, 2} [e, h]
{2, 3} [f]
{1, 2, 3} [i]

The sets of packages in that second column are the (contents of the) candidate layers. The quotient graph that they form can be built from this table because it's a subgraph of the graph induced by the subset relation over the sets of images in the first column (more on that later).

Here's the quotient graph, with each row of the above table reproduced as a node (arrows still pointing down, this graph is also directed):

{1} [a, b, d, g]       {2} [c]
         \              /  \
          \            /    \
           \          /      \
          {1, 2} [e, h]     {2, 3} [f]
                 \          /
                  \        /
                   \      /
               {1, 2, 3} [i]

Say you want to build image 2. You would take the layers corresponding to the "image sets" containing 2: {2} [c], {1, 2} [e, h], {2, 3} [f] and {1, 2, 3} [i].

Note: the algorithm we presented above to compute the quotient graph is sub-optimal. The optimal algorithm (I mean for this particular problem, not to compute quotient graphs in general) is to traverse the full dependency graph Gp in topological order starting at the roots defined by the image contents and, as you go, to mark the packages with the images that need them.

We can see that as we go down this graph of layers, the layers become required by more images. A node's image set is always a strict superset of any of its parents' image set.

Indeed, let (S, Q) be a node of the quotient graph of layers where S is the layer's image set and Q is the set of packages in the layer, and let (S', Q') be a parent node of (S, Q). By definition of the original graph Gp as a dependency graph, at least one package in Q is a dependency of some package in Q'. Since the packages in Q' are required by the images in S', it follows that those packages in Q are transitively required by the images in S'. Since Q is an equivalence class for relation R (reminder: pRq if and only if packages p and q have the exact same "requiring images"), it follows that all packages in Q are required by the images in S'. Since S is the exact set of images that require the packages in Q, and S ≠ S' (otherwise they wouldn't form two distinct nodes), we conclude that S' ⊊ S.

Once we have the layers that form an image, the order in which they're stacked to produce the image is arbitrary but it seems best if it follows a topological ordering of the graph of layers, i.e. the deepest layer in the Docker image is needed by the most images so it should also be the lowest layer in the graph of layers, and vice versa for the top layer which is the most image-specific.

It is possible that the number of layers generated by this method exceeds some allowed maximum. Maybe the graph of layers produces a set of layers small enough for one image but too large for another. In that case, we need to merge some layers that are required by that image that needs fewer layers. Those merges will make our solution less optimal so as to satisfy the constraint. There are two possible approaches: either we push packages down from image- and application-specific layers into layers where they're not needed by all images, or we bubble packages from lower, shared layers, up towards the more image- and application-specific layers, but they will get duplicated across layers. I think it's best not to add to an image packages that were not requested by the user. So that leaves the second option:

Consider a node (S,Q) in the graph of layers and call (s1, q1), ..., (sn, qn) its parent nodes. By definition of the graph of layers, all the packages in Q are needed by the images in the s_i, associated to its parent nodes (the nodes above). So we can safely remove that layer from the graph and merge its package set Q "upwards" into the package sets qi of the parent nodes. In fact, it is sufficient to merge Q into any subset of its parents { (s'1, q'1), ..., (s'm, q'm) } such that S = s'1 U ... U s'm, because what matters is that no image loses the packages it needs. (I'm not sure what happens if the s'i overlap though. Two layers required by the image would have the same nix store paths, not sure what happens then.)

Note: it's possible to prove that if an upwards merge is at all possible, then you need look no further than into parent nodes to find a viable set of merge "targets", i.e. that you don't have to merge into farther ancestor nodes to guarantee that all images will get the packages they need, but that proof is easy and left to the reader.

In the example, we can merge {1, 2} [e, h] into {1} [a, b, d, g] and {2} [c], which gives us an updated graph:

{1} [a, b, d, g, e, h]   {2} [c, e, h]
             \              |   \
              \             |    \
               \            |     \
                \           |   {2, 3} [f]
                 \          |     /
                  \         |    /
                   \        |   /
                   {1, 2, 3} [i]

Images 1 and 2 still have all the packages they need and nothing has changed for image 3.

More interestingly, for {1, 2, 3} [i], there are several possible merges because there are several ways to express {1, 2, 3} as a union of the image sets of parent (and even grandparent) nodes:

{1, 2, 3} = {1} U {2, 3} = {1, 2} U {2, 3} = {1} U {1, 2} U {2, 3} = ...

Here's the result of merging {1, 2, 3} [i] into {1} [a, b, d, g] and {2, 3} [f]:

{1} [a, b, d, g, i]    {2} [c]
         \              /  \
          \            /    \
           \          /      \
          {1, 2} [e, h]     {2, 3} [f, i]

Because we merged into a grandparent, now package e which depends on i appears above it, but the layers will still work to produce the desired images.

At this point one can wonder when such opportunities for upwards merges are present in a graph of layers, and the answer is "in most cases that are of interest to us". Let's imagine that for a given layer node (S, Q), there are no (s1, q1), ..., (sn, qn) among its parents such that S = s1 U ... U sn. In other words, some images in S are nowhere to be found in the si.

In our example, trivially, a layer node such as {2} [c, e, h] satisfies this condition since it is specific to one image. It cannot be merged upwards since there's no upper layer to merge it into. But this condition of "unmergeability" is also satisfied by {2, 3} [f] because 3 does not occur in its parent nodes' image sets. This occurs because there is no package required by image 3 that is not also required by another image.

I think it is rare enough for an image to have no package specific to it that we can assume upwards merges will almost always be possible. But this can happen if you make a final, tagged image containing, say, Chrome and Firefox, and one containing only Firefox.

With that being said, how do you choose among the many possible upwards merges? A (naive?) measure of how good an upwards merge is is how many extra bytes it introduces in the layers as a whole. If packages p1, ..., pn in layer Q get duplicated across d parent layers and then Q gets removed then the net weight gain of the operation is (d - 1) * weight(Q) = (d - 1) * (weight(p1) + ... + weight(pn)).

Minimizing the d-factor for a given layer node (S, Q) is easy: just search among (S, Q)'s parent nodes (s1, q1), ..., (sm, qm). Since their image sets are the largest strict subsets of S to be found in the graph, the smallest union of si yielding S will most easily come from them.

So to search for the best candidate upwards merges, walk the graph of layers topologically (actually, only the subgraph that concerns images that have too many layers) and at each node, if it can be used for an upwards merge, calculate its d factor and its potential net weight gain.

Let's assume that a set Q of packages has been merged into q1, ..., qm. Then, each qi's weight has increased by weight(Q) so the net weight gain of upwards-merging qi itself has increased by (di - 1) * weight(Q) where di is the d-factor of (si, qi).

As you traverse the graph and compute the candidate upwards merges, store them in a min-priority queue. Then, get popping and merging until you've reached the desired number of layers. As you remove elements from the queue and perform the corresponding merges, also update the priorities of the candidate merges of the fattened parent qi layers (some kind of "decrease key" operation on the priority queue, or even "k-decrease key" maybe?).

This weight gain minimization approach may not be the best though. Maybe it's best to merge more often closer to the bottom of the graph because the packages there are shared the most and change the least, whereas up top any large layer created by merges would have to be rebuilt if a package changes which happens more often there, close to applications. I suspect the lowest-level packages tend to be small though, so these the goals of merging only small layers and merging only deep layers could converge.

Another interesting question to ask is how the layering produced by this technique evolves as images are added or removed, and as image contents and packages evolve. We would like to be able to simply re-run the algorithm and be able to reuse most of the previously-built layers, especially if they're deep or big. And then there's the question of how layer merges interact with those evolutions, but hopefully from one evolution to the next the layers selected for merging and therefore the resulting merged layers would remain the same. I haven't gone very far thinking about those problems.

A new image x could be added. Any packages that it introduces that are not already needed by other images end up in a layer specific to x. Those packages could depend on packages already somewhere in the graph, so edges would be introduced and layers might get split as some of their packages would now belong in a different R-equivalence class. The new image x will reuse layers that already contain only packages it needs, but if a package that x requires is only ever present in layers together with packages x doesn't require, then at least one of those layers needs to be split.

An image x could be removed. The opposite of the above would happen. The layer specific to x, if it exists, would disappear. If there exists a pair of layers with image sets S and S U {x} (where S doesn't contain x) then removing x would allow us to merge those layers.

Then there are the cases of adding packages to an image that are already in P (therefore somewhere in the graph), adding entirely new packages (which may introduce new edges in Gp), updating the dependencies of a package, removing a package, etc...

For completeness' sake, updating a package in any way, even when its dependencies don't change, causes the layers it belongs to and their ancestors in the graph to need a rebuild.

That's all from me, please tell me what you think, or if I've made errors or omitted something. Thanks for reading!