Closed PaulMorrisPP closed 4 months ago
Slow image loading with
load: true
Status of uploading build result to the Docker store when using the container driver will be available on next Buildx 0.12 release: https://github.com/docker/buildx/pull/1994
This performance hole is something we would like to avoid
Showing your workflow would help to understand but I assume you're using the setup-buildx-action
and therefore a container builder. In that case it can take quite some time to load the image to the Docker store.
Correct, but why? The internal Docker builder is fast to load images, presumably because it uses some internal routine to transfer the image. Why is the container builder so much slower?
:+1: My use case: in a GitHub Action, build has to download a 6Gb file from Huggingface. Downloading that takes considerably less than downloading the cache. Then, specifying load: true
is also very slow. Building the image from scratch is about 3m, while loading from cache is about 8m, doing build and saving to cache (initial commit with build-push-action
) is 13m. When I use outputs: type=image,push=false
build takes 15s(!) but the image is not in docker image ls
. Counterintuitive. See also my Stackoverflow post
FWIW I created a sample repo with GitHub actions that demonstrate it. The idea is to create a large layer and check performance when building the image in either of three ways: Docker build action with gha
cache, Docker build action without cache, and a Docker buildx build
command in bash
. I tried a few different scenarios, and after each of them I triggered an additional run so that I could see how the cache is used. Here's the list of the scenarios I tested:
scenario | stage | timings | ||
---|---|---|---|---|
action with cache | action without cache | no action, inline build | ||
Generate 500Mb file (not from network) | first commit | 51s | 46s | 19s |
re-trigger | 34s | 42s | 19s | |
Generate 2Gb file (not from network) | first commit | 174s | 133s | 63s |
re-trigger | 113s | 133s | 63s | |
Add stress | first commit | 216s | 176s | 98s |
re-trigger | 109s | 180s | 105s | |
Download 1.5Gb file | first commit | 147s | 112s | 33s |
re-trigger | 106s | 100s | 38s | |
Download 1.5Gb + split into many files | first commit | 117s | 112s | 43s |
re-trigger | 93s | 111s | 43s |
From what I can tell, the actions are simply limited by I/O. The only scenario where cache improves performance is when stressing the CPU during build. The rest show a 20% gain at most. This still seems counter-intuitive, as the most common scenario for a Dockerfile is installing dependencies required for run.
Also, still surprisingly to me, running a Docker build action as part of a bash script is still the quickest by a factor of x2 to x3 faster - which means the majority of the I/O bottleneck is how the build action step is set-up. It adds a lot of I/O which is the limiting factor.
Another observation: this is probably because of the use of the docker-container
driver. It effectively triples the I/O required to build the image when using --load
. A solution would be to either reduce the I/O required for --load
, or to allow gha
cache use to happen with the docker
driver.
And just a bit more: running to-and-from the GHCR is much faster than --load
which again is counter-intuitive. See this workflow run.
@crazy-max Any ideas? Is there any way to utilize GHA cache for Docker builds in Github Actions without paying a performance penalty? I can think of either of the following paths as a possible solution, but I am pretty oblivious to the inner workings: (1) somehow do direct I/O between the container and the host which prevents disk I/O, perhaps avoiding the tarring and/or transfer, similar to what is happening when using GHCR; (2) implementing cache for the docker driver, so the images are directly built on the host docker; (3) adding a special cache that writes directly to the host docker, thereby avoiding the need to TAR and transfer the layers.
Correct, but why? The internal Docker builder is fast to load images, presumably because it uses some internal routine to transfer the image. Why is the container builder so much slower?
Because a build using the docker-container
driver does not have direct access to the Docker store and therefore needs to load back build result to it whereas the docker
driver has direct access.
Hi, I don't know why was this closed as completed
. Is there a solution nowadays for this problem?
The fact that an image is not easily accessible from later steps
using the container
driver can even be considered a regression. The container driver is the recommended driver as it supports caching and multi-platform builds. If we switch to the classical driver for faster and accessible builds, we lose the caching and multi-platform builds.
I'm really stuck with this as apparently there's no reasonable solution to achieve:
using container
driver.
tar
and then using docker import
be faster than load: true
or are they the same thing?Thanks!
I'm having a similar issue. My image contains torch and other cuda libs that are built from scratch - it takes ~ 30 min to build without caching. With caching, build time reduces to ~ 2 min, but load=True takes an additional ~5 minutes which is annoying. I'm also trying a workflow that can build -> test via docker run -> push to ecr if tests pass. It takes too long with load=True.
Description
I'm not sure if this is a bug or a feature request, but we're seeing large image load times that we wouldn't expect to see when building normally. That is, we're using this action to build one of our Docker Compose images manually, specifically because we want to leverage GHA caching. The whole setup works, but after buildx finishes its work and hands off the image to Docker, it builds a tarball and imports it into the image store. This import operation actually takes more time (77s) than building the image itself (60s).
This performance hole is something we would like to avoid and we want to understand if it's something we're doing wrong or just a defect in the way the image is loaded (using
load: true
), because it certainly seems to be the case that building large images does not normally take this long to load into the image store.It seems to me we really just want to be able to specify an uncompressed output using something like
--output type=image,name=...,compression=uncompressed
, but it looks like this action does not support the compression option.Partial build output