surprising rebuild when editing manifests

rhs commented 6 years ago

When you edit something in the k8s directory (or possibly another directory that isn't used by your container), forge still rebuilds the container. This can be a bit surprising because generally the k8s manifests are not used by the container builds even though hypothetically they could be.

Given the assumption you edit source code much more than manifests, this may not be a super high impact issue, but it's worth capturing in an issue in case it surprises a lot of people.

There isn't an obvious fix in forge for this due to the following issues.

First, it is difficult for forge to know which source code Dockerfiles end up touching.

Second, even if forge could figure out that the k8s manifests were not used by the container, for trace-ability we want a single git sha that identifies both the k8s manifests and the service source that is intended to work with that manifest. To achieve this we would need to relabel an already built container to have a matching container/manifest pair. Given an appropriately written Dockerfile (using dockerignore or not copying the world into the container), the container rebuild should be equivalent to a relabel. All the layers would be cached, and the only impact of the build would be the tagging part.

Right now I suspect the best fix for this issue would be to supply some best practice documentation and project templates that have appropriate Dockerfiles. If we see a pattern there we can exploit, we might be able to do a forge level fix, e.g. have forge check for and/or create/modify a dockerignore file.

blak3mill3r commented 6 years ago

I do not quite understand this. I understand that the SHA is used to tag the container image that is pushed to the registry, but where else is it used that implies that it needs to include the contents of the k8s yaml files that result from forge expanding the templates?

rhs commented 6 years ago

Sorry for the delay, this took a bit longer to swap into my brain than I expected. Also, apologies for the long post, but I haven't yet figured out a better way to explain this.

I will start with a bit more context. There are really a number of requirements interacting here:

Single source of truth for a service:

All the manifest templates and source code for a service should be able to live in a single git repo, i.e. to have a single git repo serve as a source of truth for your service.
Deterministic build process:

Any input source tree must result in exactly the same build output every single time.
Stateless build process/No loops in the build process:

One way to achieve (1) and (2) is to have a build process that:

1) builds containers from one version of the source code 2) assigns a version/tag to those containers 3) uses the assigned tags to compute a new set of manifests 4) checks those manifests back into the same git repo the source code came from but with some extra metadata specified to avoid an infinite build loop

This is awkward for a number of reasons, e.g. it polutes git history, and generally makes your source code and build process dependent on the particular machinery of whatever CI system you are using.
Traceability:

We want to be able to easily navigate from a kubernetes resources back to its source of truth, i.e. the commit that originates that resource.
Fast and local build/deploy for Developers:

In particular, we want to be able to quickly build and deploy local changes without having to first commit them into git.

You can think of any kubernetes build process as a set of functions that take in a source tree T and produce a version, a set of containers, and a set of manifests that can deploy those containers. Let's call these f_v(T), f_c(T), and f_m(T) respectively.

The loop build process described in (3) above roughly defines those functions like so:

containers = f_c(T)
version = f_v(containers) = f_v(f_c(T))
manifests = f_m(version) = f_m(f_v(f_c(T)))

Now since the container build is only as determinstic as your build toolchain (i.e. not very), the only way to make this process deterministic is to store the intermediate results, hence checking the manifests back into git, or storing them in a separate git repo.

The approach forge takes is to define these functions like so:

containers = f_c(T)
version = f_v(T)
manifests = f_m(version) = f_m(f_v(T))

Now because the version and the manifest can be computed without doing the container build at all, they can be easily made naturally deterministic rather than having to store intermediate results in order to artifically make their computation idempotent.

Forge can thereby ensure the total build process is deterministic using only a simple query against the configured docker registry to see if a particular container/tag already exists. This is how forge achieves (1), (2), and (3).

To achieve (4), and (5), forge defines it's version function as <gitrevision>.git for a clean git tree, and as <source-tree-sha>.sha for a dirty git tree. Where <source-tree-sha> is a hash of the content/names of all the (non ignored) files. This allows the entire process to be applied to both clean and dirty git trees, thereby aiding in (5), and when the tree is clean, the version is traceable, hence (4).

A side effect of this scheme is that because the version is computed from the entire source tree, it changes whenever the manifest templates change as they are also in source control. This is what results in a new container build since we need a new container for that new version also.

There are a couple of different options I can think of for addressing this, but they all have various drawbacks. I think one of the better ones so far is changing f_v(T) for dirty trees to exclude the k8s directory (possibly via a config option). This introduces a bit of inconsistency between f_v(dirty-tree) and f_v(clean-tree) but would allow you to accelerate the dev cycle.

As an aside, the reason this is a low impact issue for us is that we optimize our container builds to be quite fast, and we have some forge features to help with this. Are slow container builds exacerbating this issue for you?

mumoshu commented 6 years ago

@rhs Hi! Excuse me if I'm not following you correctly but:

This introduces a bit of inconsistency between f_v(dirty-tree) and f_v(clean-tree)

Could you just make it equivalent between dirty-tree and clean-tree when the only differences in those trees are ones ignored via forge's config file? In short, use <gitrevision>.git for a dirt-tree if changes are solely in ignored paths(=kubernetes manifests dir in this specific case).

rhs commented 6 years ago

@mumoshu That sounds like a reasonable idea to me. At least worth a try.

blak3mill3r commented 6 years ago

First of all, thank you @rhs for the detailed explanation. There are indeed a lot of things to consider here and I appreciate the complexity of it more now.

The only snag I can see with the suggestion above is if the container build itself depends on the manifest templates... which seems like a very uncommon use-case (running forge itself inside a container, for example). Perhaps this is a non-issue.

FWIW, regarding point 5: I do not see the value in deploying a container image without committing it. Creating named branches is easy, and point 4 (traceability) is very important, even in a development environment. Aren't points 4 and 5 incompatible with each other? If the k8s resource metadata point to some SHA of a dirty tree, it is not really traceable at all... right? Out of curiosity, what is the value of deploying containers built from uncomitted code?

I will continue pondering this and also try to speed up our container builds. I'm using multi-phase builds for certain components (with golang, for example... the production container is a tiny statically compiled binary on an Alpine base but the intermediate build-container has the go toolchain in it). Every build of the container seems to build the intermediate container from scratch. I will re-read the forge documentation on incremental builds and see if I can speed things up and make this less of an issue. I didn't see a way to apply it to that golang use-case the first time I read it.

Also, for posterity: there is a very reasonable workaround available, if the developer wants to assert that the container image(s) do not need to be rebuilt:

forge build manifests

will just do the templating of the manifests, and then

kubectl apply .forge/something.yml

for each manifest. This is the workaround I have been using for components with slow container builds. It would be nice if forge could figure this out on its own, though, since there's the possibility of human error here which could produce confusing results. Also I'm pretty sure doing so breaks traceability since the commit hash in the tag of the containers will point to a commit that was not used to deploy.

richarddli commented 6 years ago

@blak3mill3r I wrote https://www.datawire.io/fast-builds-java-spring-boot-applications-docker/ for Java & Spring, but it has a more detailed explanation of the incremental builds feature in Forge. I'm not very familiar with golang but the general approach should work here as well.

rhs commented 6 years ago

@blak3mill3r As you say, I suspect having the container actually depend on the manifests would be pretty unusual, but if that ever comes up we could probably make it configurable, so I'm not super worried about that possibility.

Regarding 4 and 5 being incompatible, I think that's because I didn't actually include profiles in the requirements which is yet another thing to consider of course. ;-)

If you haven't read about them, you can read a bit here: https://forge.sh/docs/reference/profiles

The way we use forge, we have stable, canary, and then dev profiles. We have our container builds setup to take no more than a few seconds. This is easy for python apps, and for our java based apps we use the rebuilder stuff @richarddli mentions to make this work.

So for our stable and canary profiles, we emphasize 4 and never use 5, but for our dev profile we depend heavily on 5 for productivity since it lets us test and debug our code in a real environment without waiting for a full CI pipeline to execute.

You're absolutely right that you never care about both for a single profile, and in fact one of the things in my queue is to add the ability to define a policy as part of the profile that would prevent you from doing dirty pushes for e.g. stable and canary. That way we can hopefully have the best of both worlds when it comes to safety and productivity. ;-)

blak3mill3r commented 6 years ago

That makes a lot of sense, thank you @rhs and @richarddli ... when I get around to it I will read what you did for Spring and see if I can make something similar work for both Golang and Clojure container builds. I'll report back here when I do that.

datawire / forge

surprising rebuild when editing manifests #120