containers / skopeo

Work with remote images registries - retrieving information, images, signing content
Apache License 2.0
8.13k stars 769 forks source link

Performance improvements for large sync jobs #1498

Open adriangb opened 2 years ago

adriangb commented 2 years ago

I recently had the need to sync multiple images across registries, and this seems like the best tool for the job. Thank you to all of the developers!

For my use case, I have several images, each with dozens to hundreds of tags. Currently, syncing is basically impossible because:

  1. The initial sync takes a long time because there are a lot of images/tags to sync. This could be mitigated by parallelizing as suggested in https://github.com/containers/skopeo/pull/1445#issuecomment-919953326

  2. Subsequent syncs take a very long time even if nothing is copied in the end. I believe this is because after getting the tags from the source, each one is checked against the destination individually. I wonder if it would be possible to get the tags from the destination and pre-compute the missing ones so that a no-op sync would be almost instant.

Unfortunately I am not a Gopher and know almost nothing about registry APIs, so I don't think I can implement this myself, but I wanted to leave the suggestion.

Thanks!

mtrmac commented 2 years ago

Thanks for your report.

Just to be sure, please make sure to use Skopeo ≥ 1.2.3, to benefit from #1189.

I wonder if it would be possible to get the tags from the destination and pre-compute the missing ones so that a no-op sync would be almost instant.

That wouldn’t work because listing tags just lists the tag names, but finding out what image the tag points to requires another per-image round-trip.


The underlying image client code is primarily single-image focused and it could almost certainly be more efficient when dealing with repos that has many tags (approximately going from 3 HTTP requests to 1).

Beyond that, actually measuring and profiling the performance might well uncover more optimization possibilities.

adriangb commented 2 years ago

Just to be sure, please make sure to use Skopeo ≥ 1.2.3, to benefit from #1189.

I was using v1.4.1

That wouldn’t work because listing tags just lists the tag names, but finding out what image the tag points to requires another per-image round-trip.

Just to be clear, you're referring to the situation where the tag exists but the image is completely different? I guess I hadn't thought of that, we use tags as if they were immutable and never duplicate them.

I guess maybe I can manually list tags for both the source and destination, compute the difference and then launch multiple processes to do the sync on each image.

mtrmac commented 2 years ago

Just to be clear, you're referring to the situation where the tag exists but the image is completely different?

Yes.

I guess I hadn't thought of that, we use tags as if they were immutable and never duplicate them.

The :latest convention, at least, is very widespread.

I guess maybe I can manually list tags for both the source and destination, compute the difference and then launch multiple processes to do the sync on each image.

That seems reasonable, and I’d be interested to hear any data about a possible speedup.

It might make sense to have an opt-in (per-repo?) option to assume tags never change, and to do the list difference optimization you suggest. I’d be happy to review a PR, but that’s not an optimization I expect to be personally working on any time soon.

adriangb commented 2 years ago

Sounds good.

Let me at least go the easy route of writing a quick wrapper around list-tags and sync, see how it goes out and go from there. Thanks for the input!

ChristianCiach commented 2 years ago

@adriangb I've built such a wrapper. Maybe it helps you to write your own. Actually, it's not really a wrapper, but a script that generates a sync.json that can be passed to skopeo sync --src yaml:

import json
import os
import subprocess

to_sync = {
    "docker.io": [
        "library/mariadb",
        "confluentinc/cp-kafka",
    ],
    "quay.io": [
        "containers/skopeo",
    ],
}

target_repo_creds = os.environ["DEST_CREDENTIALS"]
target_repo = f"your.destination.registry.tld/repo"

def get_tags(image: str, creds: str = None) -> set[str]:
    command = ["skopeo", "list-tags"]
    if creds:
        command.extend(["--creds", creds])
    command.append(f"docker://{image}")
    skopeo_result = subprocess.run(
        command, stdout=subprocess.PIPE, text=True, check=True
    )
    json_result = json.loads(skopeo_result.stdout)
    return set(json_result["Tags"])

result = {}

for registry, imgs in to_sync.items():
    images = {}
    for img in imgs:
        src_tags = get_tags(f"{registry}/{img}")
        dest_tags = get_tags(f"{target_repo}/{registry}/{img}", target_repo_creds)

        # always sync tags that start with "latest"
        dest_tags = {x for x in dest_tags if not x.startswith("latest")}
        tags_to_sync = src_tags.difference(dest_tags)

        if tags_to_sync:
                images[img] = list(tags_to_sync)
    result[registry] = {"images": images}

print(json.dumps(result))

I am creating JSON (which is a subset of YAML) instead of YAML because this script can be executed directly from within a quay.io/containers/skopeo-container, which luckily contains python3 and the json-module.

This is my Gitlab-CI-Pipeline:

stages:
  - mirror

mirror:
  stage: mirror
  image:
    name: quay.io/skopeo/stable:latest
    entrypoint: [""]
  variables:
    DEST_CREDENTIALS: "$DEST_CREDENTIALS"  # configured as Gitlab CI/CD variable
  script:
    - './create_sync_json.py > sync.json'
    - >
        skopeo sync
        --retry-times 5
        --scoped
        --dest-creds "$DEST_CREDENTIALS"
        --src yaml
        --dest docker
        sync.json
        your.destination.registry.tld/repo

As expected, this speeds up the sync process tremendously. This also prevents our destination registry from registering a pull for every synced image (see https://github.com/containers/skopeo/issues/1516).

adriangb commented 2 years ago

Nice!

I ended up building something as well, just forgot to update this thread. I needed to deal with filtering, auth and concurrency limits, so my version is considerably more complex. I like how simple yours is 😄

import argparse
import asyncio
import json
import logging
import os
import re
import sys
import typing

logging.basicConfig(level="INFO")

logger = logging.getLogger(__name__)

class Auth(typing.NamedTuple):
    username: str
    password: str

async def list_tags(repostiory: str, image: str, *, auth: typing.Optional[Auth] = None) -> list[str]:
    uri = f"docker://{repostiory}/{image}"
    auth_opts = ""
    auth_opts = f"--creds {auth.username}:{auth.password}" if auth is not None else ""
    proc = await asyncio.create_subprocess_shell(
        f"skopeo list-tags {uri} {auth_opts}",
        stdout=asyncio.subprocess.PIPE,
    )
    stdout, _ = await proc.communicate()
    if proc.returncode != 0:
        print(f"Error retrieving tags for {uri}")
        sys.exit(1)
    return json.loads(stdout)["Tags"]

def diff_tags(source_tags: typing.Iterable[str], destination_tags: typing.Iterable[str]) -> typing.Iterable[str]:
    tags = set(source_tags)
    tags.difference_update(destination_tags)
    logger.info(f"Found {len(tags)} tags in source that are not in destination")
    logger.debug(f"Planning to sync {tags}")
    return tags

def filter_tags(tags: typing.Iterable[str], pattern: str) -> typing.Iterable[str]:
    logger.debug(f"Filtering tags by pattern {pattern}")
    return (tag for tag in tags if re.match(pattern, tag))

async def sync(
    sem: asyncio.Semaphore,
    image: str,
    tag: str,
    src_repository: str,
    dest_repository: str,
    src_auth: typing.Optional[Auth],
    dest_auth: typing.Optional[Auth],
):
    src_uri = f"docker://{src_repository}/{image}:{tag}"
    dest_uri = f"docker://{dest_repository}/{image}:{tag}"
    src_auth_opts = f"--src-creds {src_auth.username}:{src_auth.password}" if src_auth is not None else ""
    dest_auth_opts = f"--dest-creds {dest_auth.username} --dest-registry-token {dest_auth.password}" if dest_auth is not None else ""
    async with sem:
        logger.info(f"Copying {src_uri} -> {dest_uri}")
        proc = await asyncio.create_subprocess_shell(
            f"skopeo copy {src_uri} {dest_uri} {src_auth_opts} {dest_auth_opts}",
        )
        await proc.wait()
        if proc.returncode != 0:
            logger.error(f"Error syncing {src_uri}")
        else:
            logger.info(f"Synced {src_uri}:{tag} -> {dest_uri}:{tag}")
        return

async def process_image(
    sem: asyncio.Semaphore,
    tasks: typing.List[asyncio.Task[None]],
    image: str,
    pattern: str,
    src_repository: str,
    dest_repository: str,
    src_auth: typing.Optional[Auth],
    dest_auth: typing.Optional[Auth],
):
    logger.info(f"Gathering tags for {src_repository}/{image} -> {dest_repository}/{image}")
    async with sem:
        src_tags, dest_tags = await asyncio.gather(
            asyncio.create_task(list_tags(src_repository, image, auth=src_auth)),
            asyncio.create_task(list_tags(dest_repository, image, auth=dest_auth)),
        )
        tags = diff_tags(filter_tags(src_tags , pattern=pattern), dest_tags)
    tags = list(tags)
    for tag in tags:
        tasks.append(
            asyncio.create_task(
                sync(
                    sem=sem,
                    src_repository=src_repository,
                    image=image,
                    tag=tag,
                    dest_repository=dest_repository,
                    src_auth=src_auth,
                    dest_auth=dest_auth,
                )
            )
        )

async def main(src_auth: typing.Optional[Auth], dest_auth: typing.Optional[Auth], concurrency: int) -> None:
    cfg = json.load(open(os.path.join(os.path.dirname(__file__), "config.json")))
    dest_repository = cfg["destination"]
    src_repository = cfg["source"]
    sync_tasks: typing.List[asyncio.Task[None]] = []
    gather_tasks: typing.List[asyncio.Task[None]] = []
    images: typing.Dict[str, str] = cfg["images"]
    sem = asyncio.Semaphore(concurrency)
    for image, pattern in images.items():
        gather_tasks.append(
            asyncio.create_task(
                process_image(
                    sem=sem,
                    tasks=sync_tasks,
                    image=image,
                    pattern=pattern,
                    src_repository=src_repository,
                    dest_repository=dest_repository,
                    src_auth=src_auth,
                    dest_auth=dest_auth,
                )
            )
        )
    await asyncio.gather(*gather_tasks)
    await asyncio.gather(*sync_tasks)

if __name__ == "__main__":
    parser = argparse.ArgumentParser("skopeo-sync")
    parser.add_argument("--src-username", help="Source registry username", type=str,  default=None)
    parser.add_argument("--src-password", help="Source registry password", type=str,  default=None)
    parser.add_argument("--dest-username", help="Destination registry username", type=str,  default=None)
    parser.add_argument("--dest-password", help="Destination registry password", type=str,  default=None)
    parser.add_argument("--concurrency", help="Max simultaneous processes to run", type=int,  default=32)
    args = parser.parse_args()

    src_auth = dest_auth = None
    if args.src_username and args.src_password:
        src_auth = Auth(args.src_username, args.src_password)
    if args.dest_username and args.dest_password:
        src_auth = Auth(args.dest_username, args.dest_password)

    asyncio.run(main(src_auth, dest_auth, args.concurrency))
{
    "destination": "...",
    "source": "...",
    "images": {
        "some/image": "^regex$",
    }
}
github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

olfway commented 2 years ago

I'm also interested in this issue

github-actions[bot] commented 2 years ago

A friendly reminder that this issue had no activity for 30 days.

helobinvn commented 1 year ago

I'm interested in this issue (from https://github.com/containers/skopeo/issues/1801)

github-actions[bot] commented 9 months ago

A friendly reminder that this issue had no activity for 30 days.