Open wking opened 5 years ago
@mtrmac PTAL I would figure this would have to be in containers/image since doing this for just podman would allow conflict on CRI-O, Skopeo and Buildah.
Well, this is where #NOBIGFATDAEMONS makes things more difficult :)
Maybe that involves a separate, separately-locked, map from digests to lock numbers, or something like that...
I was imagining flocks on ${PATH_TO_EVENTUAL_BLOB_FILE}.lock
or some such. While the lock space is large, there will probably never be more than a few dozen acquired at any one time.
@vrothberg Do we have this now?
There was a very similar discussion recently over at https://github.com/containers/libpod/issues/2551.
@rhatdan, if you want this to work, we can work on that but we need to allocate a bigger chunk of time.
Anything to improve performance, I am always interested in
I can start looking into this in the next sprint :+1:
@vrothberg Could you update this issue with your progress.
Sure. The blob-locking mechanism in containers/storage is done. Each blob, when being copied into containers-storage, will receive a dedicated lock file in the storage driver's tmp directory. That's the central mechanism for serialization and synchronization.
The progress-bar library received some backported features we needed to update the bars on the fly, and we're already making use of them.
Currently, we are working on rewriting the backend code for containers-storage a bit in order write the layer to storage in ImageDestination.PutBlob(...)
. Otherwise, the code is subject to deadlocks (ABBA). The CI's for this rewrite are now green, so we can reuse this work for the copy detection.
Note that I lost at least one working week cleaning up breaking builds and tests (unrelated to those PRs) when trying to test the PRs in buildah, libpod and cri-o (and had to do this multiple times).
what is the latest on this @vrothberg
Plenty of tears and sweat over here: https://github.com/containers/image/pull/611
It's working and CI is green but it turned into something really big, so I expect reviewing to still take a while.
going to close, reopen if needed
Reponing as it's still valid and the PR a c/image has stalled.
@baude @rhatdan if that's still a desired feature, we should revive https://github.com/containers/image/pull/611 and prioritize it or get it on our radar again.
Yes I think this is something we should fix.
This issue had no activity for 30 days. In the absence of activity or the "do-not-close" label, the issue will be automatically closed within 7 days.
Still something desirable but no progress. I'll add the label.
Ping to bring this back to peoples consciousness.
Needs a priority :^)
cri-o/cri-o#3409 fixed this issue one level up, for folks who are coming at this down that pipe. As mentioned there, there will be continued work on getting a fix down lower in the stack for other containers/image consumers, but fixing at those levels is more complicated.
I think this should start moving up the priority list, now that APIv2 has settled down.
@mtrmac PTAL
I think this should start moving up the priority list, now that APIv2 has settled down.
Something for planning. The work required to get that done is comparatively high. We've been working on it last year for a while but priorities changed.
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.
This is still necessary but still getting kicked down the road.
@umohnani8 @baude We need to bump this up the priority list.
@umohnani8 @baude We need to bump this up the priority list.
:+1: I'd love to get this done. We talked about this issue during the last team meeting. There are two parts to get it done:
While 1) is a technical requirement for 2) it has some other nice side-effects, namely the pulling is more robust. Assuming the process gets killed mid pulling or when hitting transient network errors, we do not start from scratch again as some layers may already be committed. Another nice improvement of it is that we squeezed the last remaining bits of performance in c/image.
A friendly reminder that this issue had no activity for 30 days.
Another month, lets talk about this at planning this week
A friendly reminder that this issue had no activity for 30 days.
We want to tackle this item this year. We broke it into separate pieces and I am positive we'll get it done.
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.
@vrothberg did we get some of this with your containers/image fixes?
@vrothberg did we get some of this with your containers/image fixes?
No, we got the first part addressed. The blob-copy detection is not yet done.
A friendly reminder that this issue had no activity for 30 days.
Still open.
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.
A friendly reminder that this issue had no activity for 30 days.
Still desired but we need to plan for this work since it'll consume some time.
A friendly reminder that this issue had no activity for 30 days.
Hey, thanks for your work on this issue. Can we get an update on the current progress?
We're running into this issue on a self-hosted Gitlab-runner instance with a pre-baked development image. That image is pretty big (~7GB) and once a new pipeline starts, all 20 jobs try to pull the same image at the same time, completely congesting the link.
Since there's no real way of communicating between different jobs in gitlab runner, a manual lock is sadly no solution for us.
[//]: # kind feature
Description
If there is an existing libpod pull in flight for a remote image, new pulls of that image should block until the in-flight pull completes (it may error out) to avoid shipping the same bits over the network twice.
Steps to reproduce the issue:
In one terminal:
In another terminal, launched once blob sha256:9bfce... is maybe 10MB into it's pull:
Describe the results you received:
As you can see from the console output, both commands seem to have pulled both layers in parallel.
Describe the results you expected:
I'd rather have seen the second command print a message about blocking on an existing pull, idle while that pull went through, and then run the command using the blobs pushed into local storage by that first pull.
Additional information you deem important (e.g. issue happens only occasionally):
My end goal is to front-load image pulls for a script that uses several images. Something like:
That way,
image2
andimage3
can be trickling in over a slow network while I'm spending CPU time running theimage1
container, etc.I don't really care about locking parallel manifest pulls, etc., because those are small; this just has to be for layers (possibly only for layers over a given size threshold). Of course, I'm fine with manifest/config locking if it's easier to just drop the same locking logic onto all CAS blobs. It doesn't have to be coordinated across multiple layers either. If process 1 ends up pulling layer 1, and then process 2 comes along, sees the lock on layer 1, and decides to pull layer 2 while it's waiting for the lock to lift on layer 1, that's fine. Process 1 might find layer 2 locked when it gets around to it, and they may end up leapfrogging through the layer stack. That means individual layers might come down a bit more slowly, which would have a negative impact on time-to-launch if you were limited by unpack-time. But I imagine most cases will be limited by network bandwidth, so unpacking-time delays wouldn't be a big deal.
Locking would allow for a denial of service attack by a user on the same machine with access to the lock you'd use, because they could acquire the lock for a particular layer and then idle without actually working to pull that layer down. I'm not concerned about that in my own usage, but we might want the locks to be soft, and have the caller be able to shove in and start pulling in parallel anyway if they get tired of waiting (you could scale the wait time by blob size, after guessing at some reasonable bandwidth value?).
And I realize that this is probably mostly an issue for one of libpod's dependencies, but I haven't spend the time to track down this one, and there didn't seem to be an existing placeholder issue in this repo. Please link me to the upstream issue if this already has a placeholder there.
Output of
podman version
:Output of
podman info
: