Jib should avoid parallel image downloads for same image

briandealwis commented 5 years ago

Although jib is thread-safe, it should be smarter.

Jib doesn't currently lock the base image cache when downloading a base image, but instead downloads into a temporary directory and then attempts to moves the downloaded image into place. There is no locking to block other threads. So a Maven project with N modules that build images using the same base image (like gcr.io/distroless/java) may result in N simultaneous pulls of the same image. Maybe we should provide a component to centralize downloading images?

Originally posted by @briandealwis in https://github.com/GoogleContainerTools/jib/issues/1904#issuecomment-521352655

raizoor commented 5 years ago

@briandealwis ,

In version >= 1.5.0 JIB verifys if the base image exists on target registry and not download, doesn't it? If it's correct, the centralized folder to base images is only necessary when you don't use a registry.

TadCordle commented 5 years ago

That is true, but when you don't target a registry, Jib still caches base images locally to skip unnecessary pulls. So I think this issue is really only a problem for clean multi-module builds.

chanseokoh commented 5 years ago

And this only matters when you enable the Maven parallel builds (-T <thread counts>) or run multiple Jib processes simultaneously on the same host. (And only when the base image layers have never been cached.) Does not affect correctness even in those cases.

mpeddada1 commented 3 years ago

@chanseokoh assuming this can be closed after #2780?

chanseokoh commented 3 years ago

@mpeddada1 this issue is a different one. #2780 was only in the context of multi-arch support.

nhoughto commented 2 years ago

I've been trying to incorporate Jib into our build process and one problem i'm encountering is that Jib doesn't seem to try and detect and de-dupe inflight build work. So if N requests to build the same new image come in at the same time, it will do the work N times (because the base/application cache is a cache miss).

Obviously the cache directories exist to solve for avoiding work when the work has already been done, but in the scenario where the work hasn't been done yet but many requests to do the same thing are being processed concurrently there is potential for detecting and optimising this scenario I think?

My scenario is a monorepo with N images with the base/first few layers with only the last layer differing per image, so a request to build all of them at once means jib builds the base/shared layers N times rather than once, which makes things take a lot longer than when doing the same thing under docker as it de-dupes work.

My plan was to coordinate the calls to jib such that 1 build request goes in first, so the subsequent requests would get cache hits, but that is really just solving my problem and I thought there might be a desire to solve this in jib itself as this is likely smth others might be hitting? This isn't just related to image downloads, but the full set of things Jib might do..

There is a secondary question about how this might work with maven/gradle plugins as they make things a bit harder (could be different JVMs and thus might not be able to just coordinate intra-JVM..) but ignore that for now.

chanseokoh commented 2 years ago

@nhoughto thanks for the feedback. As mentioned in this issue, what you said makes sense, and this is something not ideal. It's only that it would be a pretty complex task to address the issue for multiple reasons, which may require a substantial overhaul of the async infrastructure (unless there are some quick hacks to enable a few relatively easy performance enhancements). But more importantly, I remember I gave some thoughts on this a long time ago, and de-duping some in-flight build work is like a fundamental design change that brings both pros and cons: notably, the current implementation is that all the lines of parallel work are very fine-grained, which enables great parallelism. For example, Jib can start uploading part of base images or application binaries while it is still downloading other part of base images or still building other application layers. So, in many cases this can be much faster compared to centralized locking to delay/block all threads until Jib verifies that all the layers these threads download or build are eventually going to be identical. So, the issue really needs deep insights, although there might be some easy areas for improvement. Unfortunately, we have other priorities, and improving this aspect is not in our roadmap.

nhoughto commented 2 years ago

Thanks for the insight, I’ll look at putting something around jib then On 15 Dec 2021, 4:25 AM +1100, Chanseok Oh @.***>, wrote:

@nhoughto thanks for the feedback. As mentioned in this issue, what you said makes sense and something not ideal. It's only that it would be a pretty complex task to address for multiple reasons, which may require a substantial overhaul of the async infrastructure (unless there are some quick hacks to enable a few relatively easy performance enhancements). But more importantly, I remember I gave some thoughts on this a long time ago, and de-duping some in-flight build work is like a fundamental design change that brings pros and cons: notably, the current implementation is that all the lines of parallel work are very fine-grained and enables great parallelism. For example, Jib can start uploading part of base images or application binaries while it is still downloading other part of base images or still building other application layers. So, in many cases this can be much faster compared to centralized locking to delay/block of all threads until Jib verifies that all the layers they download or building are eventually going to be identical. So, the issue really needs deep insights, although there might be some easy areas for improvement. Unfortunately, we have other priorities, and improving this aspect is not in our roadmap. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

nhoughto commented 2 years ago

Digging into my scenarios for this ticket more, the slowness I was seeing was really due to the DockerDaemonImage behaviour more than anything, the base and app cache is pretty effective at everything other than the first run but the Docker Daemon integrate is very wasteful if you are coming from a docker build .. world and switching to jib.

Because DockerDaemonImage ends up just calling docker load -i image.tar essentially there is zero dedupe / cache checking like there would be if you did docker build .. and the full image is passed to the docker daemon each time, in my scenario where I have N containers being built concurrently sharing much of the layers this results in zero layer reuse and N full tar pushed to the daemon, making even small changes incur a large penalty. Not much jib can do about that, but maybe a note on the DockerDaemonImage that it is functional but has some caveats.

To workaround it i've effectively written a JibContainerBuilder -> Dockerfile serialiser, so am back to indirectly using docker build . where required and jib everywhere else (from one JibContainerBuilder definition).

elefeint commented 2 years ago

Thank you for the investigation and detailed notes!

chanseokoh commented 2 years ago

docker run -d -p 5000:5000 --restart always --name registry registry:2

Because DockerDaemonImage ends up just calling docker load -i image.tar essentially there is zero dedupe / cache checking like there would be if you did docker build .. and the full image is passed to the docker daemon each time

Yeah, this is very unfortunate. Unlike container registries, the Docker daemon (Docker Engine API) is very limited in that it doesn't provide a way to check, pull, or push individual layers. It's what we've been greatly lamented. OTOH, pushing to a registry with Jib is super-fast due to the strong reproducibility of Jib, so Spinning up a local registry could be an option, which is as easy as docker run -p 5000:5000 registry:2.

nhoughto commented 2 years ago

More findings on the actual original issue of this, parallel downloads of the same base image, this is actually causing me failures rather than just wasting cycles/bandwidth. Specifically against docker hub with a parallelism of 8 (and more than 8 projects sharing the same base image) some of the base image downloads would fail with an UNAUTHORIZED error whilst others would succeed, leading to an overall failure of the build. Once the base image is there obviously this isn't a problem, but when its not it causes the build to fail consistently.

I'm not sure whether this is a docker hub API thing, rate limiting or similar? Or its a Jib race condition around authentication (is anything shared across threads/instances in Jib?) the logs don't show any errors relating to rate limiting. Feels a bit like its a Jib race condition.

Either way it seems like there are meaningful problems with letting Jib download the base image in parallel and doing some de-duping will solve both problems, as a workaround atm i'm doing some locking before calls to Jib to ensure each base image is only pulled once but it would be nice if this was solved upstream 👍

meltsufin commented 2 years ago

@nhoughto Would you be interested in making a contribution for this?

nhoughto commented 2 years ago

yep can do, any tips on preferred approach?

chanseokoh commented 2 years ago

You can see detailed registry interactions by enabling HTTP debug logging. (I think you can still keep -Djib.serialize=true, since I believe what you are doing is running multiple Jib builds concurrency.) That way, you should be able to pinpoint where and when exactly the server return UNAUTHORIZED and what was the auth information given to the server.

BTW, as I alluded in https://github.com/GoogleContainerTools/jib/issues/2007#issuecomment-993809023, this isn't something we can jump into work on it.

chanseokoh commented 2 years ago

@nhoughto apologies, it's been a while, but your observation was keen and correct. I totally forgot the following known issue:

https://github.com/GoogleContainerTools/jib/blob/7effb0dff286c224a69f3c6fc8a94ea2809fbe2f/jib-core/src/main/java/com/google/cloud/tools/jib/builder/steps/PullBaseImageStep.java#L137-L140

https://github.com/GoogleContainerTools/jib/issues/2220

According to another user's analysis, it seems that you may run into the UNAUTHORIZED issue when a base image is not yet cached and there are parallel downloads going on.

GoogleContainerTools / jib

Jib should avoid parallel image downloads for same image #2007