aspect-build / rules_py

More compatible Bazel rules for running Python tools and building Python projects
Apache License 2.0
85 stars 26 forks source link

[FR]: Container image layers per-package #244

Open alexeagle opened 9 months ago

alexeagle commented 9 months ago

What is the current behavior?

Currently our best-effort to make a layered image for Python is three layers: interpreter, site-packages, app: https://github.com/aspect-build/bazel-examples/blob/main/oci_python_image/py_layer.bzl

However, when a package appears under multiple applications, the site-packages layer for each app will have a duplicate of that package, making the build outputs larger. When pytorch is one such package (it is many gigabytes for the GPU variant) then this problem leads to builds taking minutes just to fetch build inputs from a remote cache.

Describe the feature

We can probably do something with aspects to visit each third-party package and produce a tar file containing its site-packages entry as a distinct layer. Then when we build a container image for an application these tar files should just get re-used rather than being re-built.

ewianda commented 9 months ago

Hi @alexeagle I had been thinking about this problem, but never thought I could use aspect for this. So after seeing your suggestion, I went ahead implemented this feature. But the bummer was that I hit the max limit of number of layers which I believe is 127 for docker

3557592c9b6f: Loading layer [==================================================>]     432B/432B
c7d6f3eb36c2: Loading layer [==================================================>]     794B/794B
956257f1d3a3: Loading layer [==================================================>]  3.183kB/3.183kB
667da44de747: Loading layer [==================================================>]  1.343kB/1.343kB
88a0745cd9e2: Loading layer [==================================================>]   11.6kB/11.6kB
f3af998a1f5e: Loading layer [==================================================>]  3.543kB/3.543kB
9abe9f6a4fae: Loading layer [==================================================>]     906B/906B
3b5c52259645: Loading layer [==================================================>]  1.711MB/1.711MB
710de8cf73d3: Loading layer [===>                                               ]  32.77kB/497.4kB
max depth exceeded

Not sure if you had thoughts about how this will play out.

I can share my implementation may be in

https://github.com/aspect-build/bazel-examples/tree/main/oci_python_image

siddharthab commented 8 months ago

A layer per package would be too much for some docker fs implementations (I forget which ones). We typically keep separate image layers for large packages, and bundle all smaller packages together. Something like this:

HEAVY_PIP_PACKAGES = [("\\(torch\\|triton\\)/"), ("nvidia_"), ("xgboost/"), ("\\(jaxlib\\|jax\\)/"), ("scipy/"), ("numpy/")]
PY_INTERPRETER_REGEX = "\\.runfiles/rules_python~[^~/]\\+~python~python_[^/]\\+_x86_64-unknown-linux-gnu"
PIP_PACKAGES_REGEX = "\\.runfiles/rules_python~[^~/]\\+~pip~"
EXTERNAL_PACKAGES_REGEX = "\\.runfiles/[^/]\\+~"

LAYER_FILTER_CMDS = {
    "_app": "grep -v -e '{}' -e '{}' -e '{}' $< >$@".format(PY_INTERPRETER_REGEX, PIP_PACKAGES_REGEX, EXTERNAL_PACKAGES_REGEX),
    "_external": "(grep -e '{}' $< | grep -v -e '{}' -e '{}' >$@) || true".format(EXTERNAL_PACKAGES_REGEX, PIP_PACKAGES_REGEX, PY_INTERPRETER_REGEX),
    "_interpreter": "grep -e '{}' $< >$@".format(PY_INTERPRETER_REGEX),
    "_pip": "(grep -e '{}' $< | grep -v -e '~pip~pip_[^_]\\+_\\({}\\)' >$@) || true".format(PIP_PACKAGES_REGEX, "\\|".join(HEAVY_PIP_PACKAGES)),
} | {
    "_pip_heavy_{}".format(i): "(grep '\\.runfiles/rules_python~[^~/]\\+~pip~pip_[0-9]\\+_{}' $< >$@) || true".format(suffix)
    for i, suffix in enumerate(HEAVY_PIP_PACKAGES)
}

# Keep heavy, slow-changing, layers first.
ORDERED_LAYERS = ["_interpreter", "_external"] + ["_pip_heavy_{}".format(i) for i in range(len(HEAVY_PIP_PACKAGES))] + ["_pip", "_app"]

def py_image(...):
    ...
    for (suffix, cmd) in LAYER_FILTER_CMDS.items():
        native.genrule(
            name = manifest + suffix,
            srcs = [manifest],
            outs = [manifest + suffix + ".spec"],
            cmd = cmd,
            **kwargs
        )
        tar(
            name = layers + suffix,
            srcs = [binary],
            mtree = manifest + suffix,
            **kwargs
        )

    oci_image(
        name = name,
        base = base,
        entrypoint = ["/entrypoint.sh", entrypoint],
        workdir = "/" + entrypoint + ".runfiles/_main",
        tars = [entrypoint_layer] + [
            layers + suffix
            for suffix in ORDERED_LAYERS
        ],
        **kwargs
    )
alexeagle commented 8 months ago

@ewianda thanks for that investigation. Since we can only have "a few" layers, I don't have a principled suggestion for this. Some special cases like pytorch seems like the only valid compromise, but it's not a very clean API to ask the user to tell us which packages are like that at the py_image terminal. Maybe the best we can do for this issue is document the workaround. I hope as the ML folks get to a more steady state they can follow normal software engineering practices and not have a GB package?

siddharthab commented 8 months ago

Yes, documentation and guidance are our best bets here. The code that I pasted above can be refactored to accept a list of heavy packages from the user to tease out into separate layers. Anything more (like setting a wheel size limit and automatically computing this list) will need support from other components in the ecosystem.

betaboon commented 6 months ago

the nix-ecosystem had a similar problem of limiting the layer-count when building images with layer-per-package. IIRC the approach described in this blog post by graham is what ended up being implemented. maybe that's of interested to some here.

alexeagle commented 4 months ago

Nice, thanks @betaboon that's indeed very interesting. I dunno when we'll get to it... and it's not a rules_py issue, really more of a rules_oci issue that it should do something smart when there are "too many layers". (Really it's another docker problem, docker build was so poorly thought out :(