docker / build-push-action

GitHub Action to build and push Docker images with Buildx
https://github.com/marketplace/actions/build-and-push-docker-images
Apache License 2.0
4.11k stars 527 forks source link

Slow image loading with `load: true` #998

Closed PaulMorrisPP closed 4 months ago

PaulMorrisPP commented 8 months ago

Description

I'm not sure if this is a bug or a feature request, but we're seeing large image load times that we wouldn't expect to see when building normally. That is, we're using this action to build one of our Docker Compose images manually, specifically because we want to leverage GHA caching. The whole setup works, but after buildx finishes its work and hands off the image to Docker, it builds a tarball and imports it into the image store. This import operation actually takes more time (77s) than building the image itself (60s).

This performance hole is something we would like to avoid and we want to understand if it's something we're doing wrong or just a defect in the way the image is loaded (using load: true), because it certainly seems to be the case that building large images does not normally take this long to load into the image store.

It seems to me we really just want to be able to specify an uncompressed output using something like --output type=image,name=...,compression=uncompressed, but it looks like this action does not support the compression option.

Partial build output

#18 exporting to docker image format
#18 exporting layers
#18 exporting layers 4.6s done
#18 exporting manifest sha256:9a5bb6d6c26ad15fae8c7f8d46b483f78d0caab77b4d167441bdbb9a1eb47fef done
#18 exporting config sha256:b4dc547a4f859a30c5397e88fc3ec0a6014ff432d988885120de7871b4853e1d done
#18 sending tarball
#18 ...

#19 importing to docker
#19 DONE 59.8s

#18 exporting to docker image format
#18 sending tarball 72.4s done
#18 DONE 77.0s
crazy-max commented 8 months ago

Slow image loading with load: true

Status of uploading build result to the Docker store when using the container driver will be available on next Buildx 0.12 release: https://github.com/docker/buildx/pull/1994

This performance hole is something we would like to avoid

Showing your workflow would help to understand but I assume you're using the setup-buildx-action and therefore a container builder. In that case it can take quite some time to load the image to the Docker store.

PaulMorrisPP commented 8 months ago

Correct, but why? The internal Docker builder is fast to load images, presumably because it uses some internal routine to transfer the image. Why is the container builder so much slower?

kwikwag commented 7 months ago

:+1: My use case: in a GitHub Action, build has to download a 6Gb file from Huggingface. Downloading that takes considerably less than downloading the cache. Then, specifying load: true is also very slow. Building the image from scratch is about 3m, while loading from cache is about 8m, doing build and saving to cache (initial commit with build-push-action) is 13m. When I use outputs: type=image,push=false build takes 15s(!) but the image is not in docker image ls. Counterintuitive. See also my Stackoverflow post

kwikwag commented 7 months ago

FWIW I created a sample repo with GitHub actions that demonstrate it. The idea is to create a large layer and check performance when building the image in either of three ways: Docker build action with gha cache, Docker build action without cache, and a Docker buildx build command in bash. I tried a few different scenarios, and after each of them I triggered an additional run so that I could see how the cache is used. Here's the list of the scenarios I tested:

scenariostagetimings
action with cacheaction without cacheno action, inline build
Generate 500Mb file (not from network) first commit 51s46s19s
re-trigger 34s42s19s
Generate 2Gb file (not from network) first commit 174s133s63s
re-trigger 113s133s63s
Add stress first commit 216s176s98s
re-trigger 109s180s105s
Download 1.5Gb file first commit 147s112s33s
re-trigger 106s100s38s
Download 1.5Gb + split into many files first commit 117s112s43s
re-trigger 93s111s43s

From what I can tell, the actions are simply limited by I/O. The only scenario where cache improves performance is when stressing the CPU during build. The rest show a 20% gain at most. This still seems counter-intuitive, as the most common scenario for a Dockerfile is installing dependencies required for run.

Also, still surprisingly to me, running a Docker build action as part of a bash script is still the quickest by a factor of x2 to x3 faster - which means the majority of the I/O bottleneck is how the build action step is set-up. It adds a lot of I/O which is the limiting factor.

kwikwag commented 7 months ago

Another observation: this is probably because of the use of the docker-container driver. It effectively triples the I/O required to build the image when using --load. A solution would be to either reduce the I/O required for --load, or to allow gha cache use to happen with the docker driver.

kwikwag commented 7 months ago

And just a bit more: running to-and-from the GHCR is much faster than --load which again is counter-intuitive. See this workflow run.

kwikwag commented 7 months ago

@crazy-max Any ideas? Is there any way to utilize GHA cache for Docker builds in Github Actions without paying a performance penalty? I can think of either of the following paths as a possible solution, but I am pretty oblivious to the inner workings: (1) somehow do direct I/O between the container and the host which prevents disk I/O, perhaps avoiding the tarring and/or transfer, similar to what is happening when using GHCR; (2) implementing cache for the docker driver, so the images are directly built on the host docker; (3) adding a special cache that writes directly to the host docker, thereby avoiding the need to TAR and transfer the layers.

crazy-max commented 5 months ago

Correct, but why? The internal Docker builder is fast to load images, presumably because it uses some internal routine to transfer the image. Why is the container builder so much slower?

Because a build using the docker-container driver does not have direct access to the Docker store and therefore needs to load back build result to it whereas the docker driver has direct access.

ozancaglayan commented 1 month ago

Hi, I don't know why was this closed as completed. Is there a solution nowadays for this problem?

The fact that an image is not easily accessible from later steps using the container driver can even be considered a regression. The container driver is the recommended driver as it supports caching and multi-platform builds. If we switch to the classical driver for faster and accessible builds, we lose the caching and multi-platform builds.

I'm really stuck with this as apparently there's no reasonable solution to achieve:

using container driver.

Thanks!

nivetaiyer commented 1 month ago

I'm having a similar issue. My image contains torch and other cuda libs that are built from scratch - it takes ~ 30 min to build without caching. With caching, build time reduces to ~ 2 min, but load=True takes an additional ~5 minutes which is annoying. I'm also trying a workflow that can build -> test via docker run -> push to ecr if tests pass. It takes too long with load=True.