feature: Per-image pull locking for bandwidth efficiency

wking commented 5 years ago

[//]: # kind feature

Description

If there is an existing libpod pull in flight for a remote image, new pulls of that image should block until the in-flight pull completes (it may error out) to avoid shipping the same bits over the network twice.

Steps to reproduce the issue:

In one terminal:

$ podman pull docker.io/library/centos:6
Trying to pull docker.io/library/centos:6...Getting image source signatures
Copying blob sha256:9bfcefca2b8da38bbfb8b6178a75f05245688b83fda45578bcdf51f56e4a5a9e
 66.60 MB / 66.60 MB [=====================================================] 13s
Copying config sha256:0cbf37812bff083eb2325468c10aaf82011527c049d66106c3c74298ed239aaf
 2.60 KB / 2.60 KB [========================================================] 0s
Writing manifest to image destination
Storing signatures
0cbf37812bff083eb2325468c10aaf82011527c049d66106c3c74298ed239aaf

In another terminal, launched once blob sha256:9bfce... is maybe 10MB into it's pull:

$ podman run --rm docker.io/library/centos:6 echo hi
Trying to pull docker.io/library/centos:6...Getting image source signatures
Copying blob sha256:9bfcefca2b8da38bbfb8b6178a75f05245688b83fda45578bcdf51f56e4a5a9e
 66.60 MB / 66.60 MB [======================================================] 8s
Copying config sha256:0cbf37812bff083eb2325468c10aaf82011527c049d66106c3c74298ed239aaf
 2.60 KB / 2.60 KB [========================================================] 0s
Writing manifest to image destination
Storing signatures
hi

Describe the results you received:

As you can see from the console output, both commands seem to have pulled both layers in parallel.

Describe the results you expected:

I'd rather have seen the second command print a message about blocking on an existing pull, idle while that pull went through, and then run the command using the blobs pushed into local storage by that first pull.

Additional information you deem important (e.g. issue happens only occasionally):

My end goal is to front-load image pulls for a script that uses several images. Something like:

podman pull image2 image3 &
podman run image1 some stuff
podman run image2 other stuff
wait
podman run image3 more stuff

That way, image2 and image3 can be trickling in over a slow network while I'm spending CPU time running the image1 container, etc.

I don't really care about locking parallel manifest pulls, etc., because those are small; this just has to be for layers (possibly only for layers over a given size threshold). Of course, I'm fine with manifest/config locking if it's easier to just drop the same locking logic onto all CAS blobs. It doesn't have to be coordinated across multiple layers either. If process 1 ends up pulling layer 1, and then process 2 comes along, sees the lock on layer 1, and decides to pull layer 2 while it's waiting for the lock to lift on layer 1, that's fine. Process 1 might find layer 2 locked when it gets around to it, and they may end up leapfrogging through the layer stack. That means individual layers might come down a bit more slowly, which would have a negative impact on time-to-launch if you were limited by unpack-time. But I imagine most cases will be limited by network bandwidth, so unpacking-time delays wouldn't be a big deal.

Locking would allow for a denial of service attack by a user on the same machine with access to the lock you'd use, because they could acquire the lock for a particular layer and then idle without actually working to pull that layer down. I'm not concerned about that in my own usage, but we might want the locks to be soft, and have the caller be able to shove in and start pulling in parallel anyway if they get tired of waiting (you could scale the wait time by blob size, after guessing at some reasonable bandwidth value?).

And I realize that this is probably mostly an issue for one of libpod's dependencies, but I haven't spend the time to track down this one, and there didn't seem to be an existing placeholder issue in this repo. Please link me to the upstream issue if this already has a placeholder there.

Output of podman version:

$ podman version
Version:       0.11.2-dev
Go Version:    go1.10.3
Git Commit:    "6df7409cb5a41c710164c42ed35e33b28f3f7214"
Built:         Fri Nov 30 21:32:33 2018
OS/Arch:       linux/amd64

Output of podman info:

$ podman info
host:
  BuildahVersion: 1.5-dev
  Conmon:
    package: podman-0.10.1.3-1.gitdb08685.el7.x86_64
    path: /usr/libexec/podman/conmon
    version: 'conmon version 1.12.0-dev, commit: 0e7abf050824203d775a20c26ed14f716f7d1aa7-dirty'
  Distribution:
    distribution: '"rhel"'
    version: "7.5"
  MemFree: 9789857792
  MemTotal: 16289890304
  OCIRuntime:
    package: runc-1.0.0-52.dev.git70ca035.el7_5.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.0'
  SwapFree: 6291070976
  SwapTotal: 8254386176
  arch: amd64
  cpus: 8
  hostname: trking.remote.csb
  kernel: 3.10.0-891.el7.x86_64
  os: linux
  rootless: false
  uptime: 986h 7m 57.59s (Approximately 41.08 days)
insecure registries:
  registries: []
registries:
  registries:
  - registry.access.redhat.com
store:
  ContainerStore:
    number: 10
  GraphDriverName: overlay
  GraphOptions:
  - overlay.override_kernel_check=true
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
  ImageStore:
    number: 12
  RunRoot: /var/run/containers/storage

rhatdan commented 5 years ago

@mtrmac PTAL I would figure this would have to be in containers/image since doing this for just podman would allow conflict on CRI-O, Skopeo and Buildah.

mtrmac commented 5 years ago

Well, this is where #NOBIGFATDAEMONS makes things more difficult :)

We would need some kind of inter-process locking mechanism, like the one being built for container state — but with a much larger namespace (blob digests), so we can’t just use a preallocated dense integer table. Maybe that involves a separate, separately-locked, map from digests to lock numbers, or something like that (i.e. it is doable in principle, but better implementation ideas would be nice). That locking mechanism probably should live also, or be very close to, containers/storage, as the data structure being actually locked.
It’s not quite obvious which digest to lock on:
- We can lock on the compressed digest, but that won’t prevent concurrent pulls of differently-compressed identical layers.
- We can lock on the uncompressed guest if it is known (e.g. from the blob info cache), but that won’t always be the case
- Locking both is more risky WRT race conditions and the like. I guess the compressed digest would be a starting point.
As far as two pulls “racing each other”, that would require layer copy parallelization; right now the code is sequential, so blocking on a lock would just block the pull until the other process finishes. (Sure, that’s still an improvement because the data is only downloaded and uncompressed once.)

wking commented 5 years ago

Maybe that involves a separate, separately-locked, map from digests to lock numbers, or something like that...

I was imagining flocks on ${PATH_TO_EVENTUAL_BLOB_FILE}.lock or some such. While the lock space is large, there will probably never be more than a few dozen acquired at any one time.

rhatdan commented 5 years ago

@vrothberg Do we have this now?

vrothberg commented 5 years ago

There was a very similar discussion recently over at https://github.com/containers/libpod/issues/2551.

@rhatdan, if you want this to work, we can work on that but we need to allocate a bigger chunk of time.

rhatdan commented 5 years ago

Anything to improve performance, I am always interested in

vrothberg commented 5 years ago

I can start looking into this in the next sprint :+1:

rhatdan commented 5 years ago

@vrothberg Could you update this issue with your progress.

vrothberg commented 5 years ago

Sure. The blob-locking mechanism in containers/storage is done. Each blob, when being copied into containers-storage, will receive a dedicated lock file in the storage driver's tmp directory. That's the central mechanism for serialization and synchronization.

The progress-bar library received some backported features we needed to update the bars on the fly, and we're already making use of them.

Currently, we are working on rewriting the backend code for containers-storage a bit in order write the layer to storage in ImageDestination.PutBlob(...). Otherwise, the code is subject to deadlocks (ABBA). The CI's for this rewrite are now green, so we can reuse this work for the copy detection.

Note that I lost at least one working week cleaning up breaking builds and tests (unrelated to those PRs) when trying to test the PRs in buildah, libpod and cri-o (and had to do this multiple times).

baude commented 5 years ago

what is the latest on this @vrothberg

vrothberg commented 5 years ago

Plenty of tears and sweat over here: https://github.com/containers/image/pull/611

It's working and CI is green but it turned into something really big, so I expect reviewing to still take a while.

baude commented 5 years ago

going to close, reopen if needed

vrothberg commented 5 years ago

Reponing as it's still valid and the PR a c/image has stalled.

vrothberg commented 5 years ago

@baude @rhatdan if that's still a desired feature, we should revive https://github.com/containers/image/pull/611 and prioritize it or get it on our radar again.

rhatdan commented 5 years ago

Yes I think this is something we should fix.

github-actions[bot] commented 5 years ago

This issue had no activity for 30 days. In the absence of activity or the "do-not-close" label, the issue will be automatically closed within 7 days.

vrothberg commented 5 years ago

Still something desirable but no progress. I'll add the label.

rhatdan commented 4 years ago

Ping to bring this back to peoples consciousness.

vrothberg commented 4 years ago

Needs a priority :^)

wking commented 4 years ago

cri-o/cri-o#3409 fixed this issue one level up, for folks who are coming at this down that pipe. As mentioned there, there will be continued work on getting a fix down lower in the stack for other containers/image consumers, but fixing at those levels is more complicated.

rhatdan commented 4 years ago

I think this should start moving up the priority list, now that APIv2 has settled down.

rhatdan commented 4 years ago

@mtrmac PTAL

vrothberg commented 4 years ago

I think this should start moving up the priority list, now that APIv2 has settled down.

Something for planning. The work required to get that done is comparatively high. We've been working on it last year for a while but priorities changed.

github-actions[bot] commented 4 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 3 years ago

This is still necessary but still getting kicked down the road.

rhatdan commented 3 years ago

@umohnani8 @baude We need to bump this up the priority list.

vrothberg commented 3 years ago

@umohnani8 @baude We need to bump this up the priority list.

:+1: I'd love to get this done. We talked about this issue during the last team meeting. There are two parts to get it done:

We need to commit layers to the storage as soon as we can. Currently, we first download all layers in parallel but commit only once all are downloaded. Committing happens in sequence.
Once 1) is done, we have all things in place to detect which layers are currently being pulled (and committed) without running into ABBA locks.

While 1) is a technical requirement for 2) it has some other nice side-effects, namely the pulling is more robust. Assuming the process gets killed mid pulling or when hitting transient network errors, we do not start from scratch again as some layers may already be committed. Another nice improvement of it is that we squeezed the last remaining bits of performance in c/image.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 3 years ago

Another month, lets talk about this at planning this week

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

vrothberg commented 3 years ago

We want to tackle this item this year. We broke it into separate pieces and I am positive we'll get it done.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 3 years ago

@vrothberg did we get some of this with your containers/image fixes?

vrothberg commented 3 years ago

@vrothberg did we get some of this with your containers/image fixes?

No, we got the first part addressed. The blob-copy detection is not yet done.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

rhatdan commented 3 years ago

Still open.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 3 years ago

A friendly reminder that this issue had no activity for 30 days.

vrothberg commented 3 years ago

Still desired but we need to plan for this work since it'll consume some time.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

Nukesor commented 6 months ago

Hey, thanks for your work on this issue. Can we get an update on the current progress?

We're running into this issue on a self-hosted Gitlab-runner instance with a pre-baked development image. That image is pretty big (~7GB) and once a new pipeline starts, all 20 jobs try to pull the same image at the same time, completely congesting the link.

Since there's no real way of communicating between different jobs in gitlab runner, a manual lock is sadly no solution for us.

containers / podman

feature: Per-image pull locking for bandwidth efficiency #1911