Use Docker volumes instead of embedding source/artifacts in image

Having things cloned beforehand is good, but writing to them isn't. There's notable overhead when writing to a CoW overlay, compared to Docker volumes.

With the following method, we focus on both space savings and I/O speedup:

We will decouple the Git objects into a separate read-only volume to avoid duplication. The reason of doing this is that if an repository is modified repeatedly, Git cannot pack the object in a deterministic way that the filesystem can deduplicate. Instead, we will be using git clone --reference so that the .git directory only contains the objects that the user created. (I have verified that this will also deduplicate objects when we fetch upstream.)
We will be using a native Copy-on-Write filesystem to achieve low write overhead. OverlayFS is extremely bad on deleting files as it amplifies the write that normally only affects inode/dentry to metadata blocks, which converts to roughly 10x slowness when doing make clobber.
We will decouple the runtime environment itself from the build artifacts. This way, we can apply system and toolchain updates without destroying the current working tree.

Focusing on the CoW part, we will be using a few approach to allow people (including us) without a native filesystem to run Janitor:

Operation/Hook	Union filesystem based	CoW filesystem based
After pull		Directly tag image as `janitor-production`
Clone source and build	Only mount the ro volume Build from a fresh layer Tag `janitor-production`	Mount the ro volume and a snapshot of previous data volume (if exists) Perform dirty build if enabled
Start container	Only mount the ro volume, directly use image	Create a snapshot of the data volume Mount both ro and data volume
Stop container		Destroy the snapshot

These features will be specific to CoW backends: dirty rebuild, upgrades without deleting working tree. We will add a flag in the Janitor application so these feature are not shown in the UI when not available.

Next, the plan of migration:

[ ] Refactor the dockerfiles repo so that build scripts are decoupled from Dockerfile
[ ] Implement a pluggable volume/filesystem backend in Janitor app
[ ] Implement build-after-pull and stop relying on CI for build
[ ] Implement the ro filesystem mechanism
[ ] Implement and test the CoW filesystem backend, plus feature flags
[ ] At this point, we can finally put all of the things into production

Thanks a lot for this suggestion! As mentioned on IRC, we include sources and pre-compiled artifacts in Docker images because it has the following advantages:

Containers are created really fast (<2s) because there is almost no set-up needed that's not already part of the image.
Containers include a recent, pre-compiled source tree (so when you tweak a few files, and run a build, it will be incremental so hopefully really fast, e.g. if you run ninja -C out/Default in a fresh Chrome container, it will even say "no work to do").
We assume that Docker's Copy-on-Write keeps most of the source files and compiled binaries stored only once among hundreds of developers on the same instance (on the assumption that developers rarely modify 100% of every source file, and rarely recompile binaries that are 100% different).
Our current system is very simple (i.e. everything happens inside one Dockerfile, as opposed to having to orchestrate separate source/workspace volumes to pre-compile separately and auto-mount into containers).

Having things cloned beforehand is good, but writing to them isn't. There's notable overhead when writing to a CoW overlay, compared to Docker volumes.

Interesting, thanks for this insight. Could you please elaborate on how significant this overhead is? Is there a particularly write-intensive workflow that you find too slow on Janitor today because of this?

clone the source and artifacts to a different directory, then copy them on startup with a entrypoint script.

This would instantly remove the benefits of Copy-on-Write, by having every container store 100% of its source files and pre-compiled binaries separately from every other container, right? (So if we have a 10GB checkout with 10GB pre-compiled binaries in a Docker image, 100 new containers would instantly fill 2TB of disk space)

do not do source related work in Dockerfiles, instead have a separate step instead. (docker build doesn't handle volumes yet and we can't expect it in near future either)

I have the intuition that we'll have to somehow remove the "workspace" from our current Docker containers. Because old containers keep large old images from being garbage collected, we want to delete containers as soon as possible, and if we're able to extract 100% of the "valuable state" from a given container (i.e. any user-made changes like configurations, commits, branches, uncommitted work/source file changes) then we can delete a container and restore it at will, freeing up a lot of resources on our infrastructure.

However, I have no idea what the best design for such a "removed workspace" or "extracted valuable state" would look like. E.g. a mounted volume? A database of user-made changes? A private VCS branch along with uncommitted changes that can be restored at will?

To me, the most important aspects of such a thing would be:

It shouldn't be too complex (if it requires a large team of highly skilled engineers just to maintain it, then we should probably look for something simpler)
Container restore should be fast (if you're ready to resume your work on Janitor, but you have to wait 3 minutes for your container/state to be restored, then you might as well go back to using a local checkout on your laptop)
We should keep some sort of resource de-duplication between our users (e.g. copy-on-write and/or shared build cache). Because the point of having a shared infrastructure with hundreds of developers working on the same project is that you can aggressively factor out common disk space usage and similar computing workflows.

I dug a bit for userspace level dedup with git clone --reference.

While this works for submodules very well (just reference a recursive non-bare clone and you can make a clone instantaneously), most projects currently in Janitor either doesn't use git (Firefox) or use a home-grown monorepo tool, making integration hard.

Another obstacle is compilation artifacts. ccache supports only one cache directory, which means we can't make some read-only fallback.

Here is a benchmark that @ishitatsuyuki made to prove his point:

docker run -it --rm -v /tmp:/mnt janx/thunderbird /bin/bash
cp -ar $PWD /mnt/thunderbird
find . > /dev/null
find . > /dev/null
time ./mozilla/mach clobber
cd /mnt/thunderbird/
find . > /dev/null
find . > /dev/null
time ./mozilla/mach clobber

Results:

13:14:20 ishitatsuyuki> 4m59s on CoW 13:15:20 ishitatsuyuki> 24s on volume 13:15:33 ishitatsuyuki> So yes, roughly 10x overhead

I have personally settled to generating warming up volumes beforehand, so we don't need to wait them to be copied when a container is created. We can have a configurable size of volume pool to achieve this.

Thanks a lot for updating your plan, and for seeking valuable performance improvements! Here are a few personal thoughts on this update.

Having things cloned beforehand is good, but writing to them isn't. There's notable overhead when writing to a CoW overlay, compared to Docker volumes.

While I agree that there is notable overhead with our current CoW overlay, this hasn't been a frequent user complaint so far, and I don't see this as a problem in our current operations:

Following our latest survey, users already perceive Janitor as fast (except for noVNC's latency), and almost all their requests revolve around new features (the most-requested being 1) additional OS support like Windows, Mac OS and Android; and 2) better VCS integration with GitHub, Phabricator, etc; followed by 3) better code analysis and debugging tools).
What is a major problem in our current operations is high disk space usage (mostly old pre-compiled source trees, being kept alive by old user containers). This is the single biggest thing that I'd like to see us address, for the sake of our project's future.

We will decouple the Git objects into a separate read-only volume to avoid duplication.

Please note that not all Janitor projects use Git. Many Mozilla projects use Mercurial, and some projects like Chromium use their custom source syncing tools (e.g. fetch).

Also, I only see limited value in decoupling Git objects alone. For example, the ./mozilla/mach clobber overhead you mention is due to build artifacts being coupled to Docker's CoW, not Git objects (the Thunderbird image doesn't even use Git), and the biggest disk space overhead in our current images is not Git objects (42.3MB layer in the Chromium image) but pre-compiled artifacts (26.6GB layer in the Chromium image) multiplied by how many old trees are kept alive by old user containers.

Does your solution de-couple build artifacts? And if so, does it reduce the disk space overhead caused by old pre-compiled source trees from old containers?

We will be using a native Copy-on-Write filesystem to achieve low write overhead. OverlayFS is extremely bad on deleting files

Please note that we use a variety of OSes in our community-backed Docker servers (e.g. Debian, Ubuntu, Amazon Linux, ...) which might not all have good native Copy-on-Write filesystems available, nor is it always possible/easy to create dedicated CoW partitions in these servers.

Additionally, here you analyze OverlayFS, but we also use a variety of Docker storage drivers in our servers (e.g. OverlayFS, AUFS, DeviceMapper, ...) which may not all share the same performance aspects.

We will decouple the runtime environment itself from the build artifacts. This way, we can apply system and toolchain updates without destroying the current working tree.

De-coupling the system from user's working trees is a nice idea, and that's what Cloud9 IDE was doing for their (very small) user workspaces.

However, I'm afraid that it increases operational complexity significantly, while we're just a small team of part-time volunteers. Additionally, while it would solve the problem of upgrading the users' system and toolchains without disrupting their current working trees, it doesn't solve the problem of old working trees taking up all our disk space (we'd be shifting the disk space problem from old Docker images to old volumes, without solving it).

Union filesystem based | CoW filesystem based

I find these approach names ambiguous and confusing. Could you please append a clear "(current approach)" or "(suggested new approach)" hint to the column names?

Refactor the dockerfiles repo so that build scripts are decoupled from Dockerfile

As mentioned on IRC, please keep this involved refactoring in a separate branch for now. I'd like to avoid increasing our Dockerfiles' complexity, especially for experimental changes.

Implement build-after-pull and stop relying on CI for build

To me, this is a step back. We used to build on-premises, taking the large performance hit, and validating images ourselves (e.g. build failures were detected after a pull, not before).

Moving to CI builds greatly simplified our life (we now continuously rebuild images in the background, without taking a performance hit, and we can validate new commits and pull requests before pulling them). Please keep project builds outside of our Docker servers if possible.

At this point, we can finally put all of the things into production

Before this step, we need to take a hard look at what performance wins this approach is yielding (pondered by the value it brings to users, e.g. better noVNC latency is a much bigger win than 10x faster clobber times), and what costs we are paying for them in terms of complexity (more software to maintain, more steps in our deployments and maintenance efforts, generally more moving parts that can lead to more & trickier bugs than simple monolithic Docker containers). If the benefits are not overwhelmingly superior to the costs, then I'd vote against this approach.

Thanks again for championing this very interesting experiment! I'm really looking forward to knowing more about gains vs costs here.

Please note that not all Janitor projects use Git. Many Mozilla projects use Mercurial, and some projects like Chromium use their custom source syncing tools (e.g. fetch).

I admit that Mercurial doesn't support this pattern. In Chromium this is likely viable though, by using some tricks for depot_tools.

and the biggest disk space overhead in our current images is not Git objects (42.3MB layer in the Chromium image) but pre-compiled artifacts (26.6GB layer in the Chromium image) multiplied by how many old trees are kept alive by old user containers.

The git folder itself also weight about 10GB; deduplicating it is a 20% improvement which is not bad.

Does your solution de-couple build artifacts? And if so, does it reduce the disk space overhead caused by old pre-compiled source trees from old containers?

Build artifacts are not stored in the read-only volume. However, CoW allows dirty rebuild, where we can see improvements in the situation.

To me, this is a step back. We used to build on-premises, taking the large performance hit, and validating images ourselves (e.g. build failures were detected after a pull, not before).

Maybe we can run a CI with some merge gating bot like bors-ng. This comes with maintenance cost, but it will vastly improve the CI feedback time.

JanitorTechnology / janitor

Use Docker volumes instead of embedding source/artifacts in image #282