ReproNim / containers

Containers "distribution" for reproducible neuroimaging
Apache License 2.0
26 stars 16 forks source link

script to mirror all versions of containers from docker hub #48

Open yarikoptic opened 4 years ago

yarikoptic commented 4 years ago

e.g. immediate usecase is "backup" of all neurodebian containers from https://hub.docker.com/_/neurodebian . I only wonder if it should all be dumped here or should be a separate dataset (probably separate). @kyleam - do you some easy wrapper around docker to given a docker hub "repository" (such as https://hub.docker.com/_/neurodebian) to mint all needed calls to datalad containers-add?

kyleam commented 4 years ago

@kyleam - do you some easy wrapper around docker to given a docker hub "repository" (such as https://hub.docker.com/_/neurodebian) to mint all needed calls to datalad containers-add?

Hmm, does this come down to wanting a containers-add call for each tag in the repository? If so, perhaps something like this:

import requests

endpoint = "https://hub.docker.com/v2"
repo = "neurodebian"
next_page = f"{endpoint}/repositories/library/{repo}/tags"

while next_page:
    print(next_page)
    response = requests.get(next_page)
    response.raise_for_status()
    data = response.json()
    next_page = data.get("next")
    for result in data.get("results"):
        print(result["name"])
        # containers-add call if container isn't already present
kyleam commented 4 years ago

I've pushed a rough prototype, which may or may not be viable start, to kyleam/datalad-container-dhub-tags (obviously not its final home if it ends up being useful). python containers_add_dhub_tags.py --help will give a few more details, but basically

echo neurodebian | python containers_add_dhub_tags.py

should lead to a containers-add --url dhub:// ... call for each tag under the neurodebian repository.

echo repronim/ | python containers_add_dhub_tags.py

should get a containers-add --url dhub:// ... call for each tag in each repository under repronim.

yarikoptic commented 4 years ago

awesome -- I will give it a shot now. I am just curious -- why stdin? ;-)

yarikoptic commented 4 years ago

It works! Some observations I would like your input on -- I might be not seeing some possible issues

NB it could even be -i dhub://library/neurodebian/xenial-non-free/f5098f2 thus really reflecting original URL ;)

kyleam commented 4 years ago
  • since there is only one version for a tag at a given point in time when we run it, and otherwise information about the tag would be lost, why not to create those container names following tags, e.g. instead of neurodebian-f5098f2 for dhub://library/neurodebian:xenial-non-free it to be just neurodebian--xenial-non-free [...]

For the initial pass, the digest gave me an easy thing to use without worrying about cleaning the names. There is (or was, after the latest push) a todo comment about this. I've switched it to cleaning the names now.

  • even though not immediately usable, it would imho still be useful then to place those images not under flat .datalad/images but to establish uniform hierarchy of directories reflecting those docker hub repositories etc., so smth like

I've added namespace/repo subdirectories under the directories for images and manifests.


Please feel free to tweak and to move the script to wherever you'd like it.

yarikoptic commented 4 years ago

@kyleam do you know if there is some associated with manifest date (I do not see anything in manifest)?

kyleam commented 4 years ago

do you know if there is some associated with manifest date (I do not see anything in manifest)?

I'm not sure I understand your question, but either way I'm confident that I don't know.

yarikoptic commented 4 years ago

FWIW, I see that response headers for the tag contain only current Date, not some Last-modified

*(Pdb) p resp_man.headers
{'Content-Length': '2189', 'Content-Type': 'application/vnd.docker.distribution.manifest.v2+json', 'Docker-Conten
t-Digest': 'sha256:9e131ac6f30d682d71cbdbcd98e0c40b0b730e179172535dce4c5a82a2283c26', 'Docker-Distribution-Api-Ve
rsion': 'registry/2.0', 'Etag': '"sha256:9e131ac6f30d682d71cbdbcd98e0c40b0b730e179172535dce4c5a82a2283c26"', 'Dat
e': 'Fri, 30 Oct 2020 16:47:35 GMT', 'Strict-Transport-Security': 'max-age=31536000'}

why am I asking -- I thought to add the datetime to the image directory/name so it would provide ordering among images

yarikoptic commented 4 years ago

FWIW, found the dates in the listing of tags:

*(Pdb) pprint(data)
{'count': 72,
 'next': 'https://hub.docker.com/v2/repositories/library/neurodebian/tags?page=2',
 'previous': None,
 'results': [{'creator': 2215,
              'full_size': 46057490,
              'id': 13429244,
              'image_id': None,
              'images': [{'architecture': 'amd64',
                          'digest': 'sha256:9e131ac6f30d682d71cbdbcd98e0c40b0b730e179172535dce4c5a82a2283c26',
                          'features': '',
                          'last_pulled': '2020-10-30T14:15:58.27871Z',
                          'last_pushed': '2020-10-23T18:40:17.764065Z',
                          'os': 'linux',
                          'os_features': '',
                          'os_version': None,
                          'size': 46057490,
                          'status': 'active',
                          'variant': None}],
              'last_updated': '2020-10-23T18:40:49.940676Z',
              'last_updater': 1156886,
              'last_updater_username': 'doijanky',
              'name': 'xenial-non-free',
              'repository': 42825,
              'tag_last_pulled': '2020-10-30T14:15:58.27871Z',
              'tag_last_pushed': '2020-10-23T18:40:49.940676Z',
              'tag_status': 'active',
              'v2': True},

so will take last_pushed for the image which might (unless the same image is re-pushed periodically for some reason?). also it suggested that we might want to encode architecture as well generally speaking, and may be even dump that image information record along side.

yarikoptic commented 4 years ago

eh, confused between those "images" (multiple per tag) and then "manifest"'s digest being just one and do correspond to the image ID (but not really to digest record in the list of images), e.g. if I look at "busybox" (with bunch of architectures):

dump of exploration ```shell *(Pdb) pprint([i for i in images if i['architecture'] == 'amd64']) [{'architecture': 'amd64', 'digest': 'sha256:c9249fdf56138f0d929e2080ae98ee9cb2946f71498fc1484288e6a935b5e5bc', 'features': '', 'last_pulled': '2020-10-30T15:49:02.178905Z', 'last_pushed': '2020-10-14T10:25:29.856917Z', 'os': 'linux', 'os_features': '', 'os_version': None, 'size': 764619, 'status': 'active', 'variant': None}] (Pdb) pprint(resp_man.json()) {'config': {'digest': 'sha256:f0b02e9d092d905d0d87a8455a1ae3e9bb47b4aa3dc125125ca5cd10d6441c9f', 'mediaType': 'application/vnd.docker.container.image.v1+json', 'size': 1493}, 'layers': [{'digest': 'sha256:9758c28807f21c13d05c704821fdd56c0b9574912f9b916c65e1df3e6b8bc572', 'mediaType': 'application/vnd.docker.image.rootfs.diff.tar.gzip', 'size': 764619}], 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'schemaVersion': 2} (Pdb) [3] + 3845902 suspended python ./containers_add_dhub_tags.py <(echo busybox) (dev3) 2 12894 ->148 [3].....................................:Fri 30 Oct 2020 01:26:39 PM EDT:. (git)smaug:/mnt/btrfs/datasets/datalad/crawl/repronim/containers-backup[master]git $> docker pull busybox Using default tag: latest latest: Pulling from library/busybox 9758c28807f2: Pull complete Digest: sha256:a9286defaba7b3a519d585ba0e37d0b2cbee74ebfe590960b0b1d6a5e97d1e1d Status: Downloaded newer image for busybox:latest (dev3) 2 12895 [3].....................................:Fri 30 Oct 2020 01:26:48 PM EDT:. (git)smaug:/mnt/btrfs/datasets/datalad/crawl/repronim/containers-backup[master]git $> docker images --digests --all | grep busybox busybox latest sha256:a9286defaba7b3a519d585ba0e37d0b2cbee74ebfe590960b0b1d6a5e97d1e1d f0b02e9d092d 2 weeks ago 1.23MB ``` with code diff (probably not paste/patchable as is) ``` diff --git a/containers_add_dhub_tags.py b/containers_add_dhub_tags.py index 0755fbe..7f70bfa 100644 --- a/containers_add_dhub_tags.py +++ b/containers_add_dhub_tags.py @@ -16,6 +16,7 @@ import fileinput import json import logging from pathlib import Path +from pprint import pprint import re import requests @@ -42,7 +43,10 @@ def clean_container_name(name): def add_container(repo, tag, digest): from datalad.api import containers_add - target = Path("images", repo, digest) + # add suffix .dockersave so later we might save some other types + # of serialization or singularity converted images + + target = Path(repo, "%s-%s.dockersave" % (tag, digest[:8])) if target.exists(): lgr.info("Skipping %s:%s. Already exists: %s", repo, tag, target) @@ -73,21 +77,24 @@ def write_manifest(repo, digest, manifest): target.write_text(json.dumps(manifest)) -def get_manifests(repo, tags): +def get_manifest_images(repo, tag_images): resp_auth = requests.get(REGISTRY_AUTH_URL.format(repo=repo)) resp_auth.raise_for_status() headers = { "Authorization": "Bearer " + resp_auth.json()["token"], "Accept": "application/vnd.docker.distribution.manifest.v2+json"} - for tag in tags: + for tag, images in tag_images: lgr.debug("Getting manifest for %s:%s", repo, tag) # TODO: Can we check with HEAD first to see if the digest # matches what we have locally? resp_man = requests.get(f"{REGISTRY_ENDPOINT}/{repo}/manifests/{tag}", headers=headers) resp_man.raise_for_status() - yield tag, resp_man.json() + if len(images) != 1: + import pdb; pdb.set_trace() + raise NotImplementedError("ATM supporting only 1 image per tag. Got %s" % str(images)) + yield tag, resp_man.json(), images[0] def walk_pages(url): @@ -101,10 +108,10 @@ def walk_pages(url): yield from data.get("results", []) -def get_repo_tags(repo): +def get_repo_tag_images(repo): url = f"{DHUB_ENDPOINT}/repositories/{repo}/tags" for result in walk_pages(url): - yield result["name"] + yield result["name"], result["images"] def get_namespace_repos(name): @@ -150,7 +157,8 @@ def process_files(files): for repo in repos: try: - for tag, manifest in get_manifests(repo, get_repo_tags(repo)): + for tag, manifest, image in get_manifest_images(repo, get_repo_tag_images(repo)): + import pdb; pdb.set_trace() digest = manifest["config"]["digest"] assert digest.startswith("sha256:") digest = digest[7:] ```

So is "image id" (which is in the manifest) the same across architectures?

yarikoptic commented 4 years ago

yeap -- manifest is for all archs. Then it is possible to pull a specific image by giving that digest from images:

$> docker pull busybox@sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0
sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0: Pulling from library/busybox
5dce72bf4214: Pull complete 
Digest: sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0
Status: Downloaded newer image for busybox@sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0

$> docker images --digests --all | grep busybox
busybox             <none>              sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0   65a89d0f0344        2 weeks ago         1.4MB
busybox             latest              sha256:a9286defaba7b3a519d585ba0e37d0b2cbee74ebfe590960b0b1d6a5e97d1e1d   f0b02e9d092d        2 weeks ago         1.23MB
busybox             latest              sha256:c9249fdf56138f0d929e2080ae98ee9cb2946f71498fc1484288e6a935b5e5bc   f0b02e9d092d        2 weeks ago         1.23MB

which is I guess what docker pull does -- chooses the one for matching architecture, but could be hand-twisted to run for the one desired (arm64 in this example)

$> docker run -it --rm busybox@sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0
standard_init_linux.go:178: exec user process caused "exec format error"

so, to support dumping images for various architectures there should also be a layer with <arch>-digest/.

yarikoptic commented 4 years ago

another note (sorry for abusing this issue for that) summary: so far found no "datetime" for tag manifest, and tag manifest could change while referenced images would stay the same (and the same dates) (or change), so there in principle no "datetime" for tag (manifest), but for the images... but the next confusion point -- manifest record lists layers (a single one) -- how come that only 1 (I bet different architectures images have different ones... uff)

edit: if I do not request specific type of record for manifest, returned record does match described in API doc and returns record for a specific architecture

details ```shell (Pdb) pprint(requests.get(f"{REGISTRY_ENDPOINT}/{repo}/manifests/{tag}", headers=dict(Accept="application/vnd.docker.distribution.manifest.v2+json", **headers)).json()) {'config': {'digest': 'sha256:f0b02e9d092d905d0d87a8455a1ae3e9bb47b4aa3dc125125ca5cd10d6441c9f', 'mediaType': 'application/vnd.docker.container.image.v1+json', 'size': 1493}, 'layers': [{'digest': 'sha256:9758c28807f21c13d05c704821fdd56c0b9574912f9b916c65e1df3e6b8bc572', 'mediaType': 'application/vnd.docker.image.rootfs.diff.tar.gzip', 'size': 764619}], 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'schemaVersion': 2} *(Pdb) pprint(requests.get(f"{REGISTRY_ENDPOINT}/{repo}/manifests/{tag}", headers=dict(**headers)).json()) {'architecture': 'amd64', 'fsLayers': [{'blobSum': 'sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4'}, {'blobSum': 'sha256:9758c28807f21c13d05c704821fdd56c0b9574912f9b916c65e1df3e6b8bc572'}], 'history': [{'v1Compatibility': '{"architecture":"amd64","config":{"Hostname":"","Domainname":"","User":"","AttachStdin":false,"AttachStdout":false,"AttachStderr":false,"Tty":false,"OpenStdin":false,"StdinOnce":false,"Env":["P ATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"Cmd":["sh"],"ArgsEscaped":true,"Image":"sha256:11565868e68267a053372359046e1e70ce095538e95ff8398defd49bb66ddfce","Volumes":null,"WorkingDir":"","Entrypoint":nu ll,"OnBuild":null,"Labels":null},"container":"6f1f5d35fed541933daae185eac73e333818ccec0b0760eb4cc8e30ce8d69de6","container_config":{"Hostname":"6f1f5d35fed5","Domainname":"","User":"","AttachStdin":false,"AttachStdout":false,"A ttachStderr":false,"Tty":false,"OpenStdin":false,"StdinOnce":false,"Env":["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],"Cmd":["/bin/sh","-c","#(nop) ' '","CMD ' '[\\"sh\\"]"],"ArgsEscaped":true,"Image":"sha256:11565868e68267a053372359046e1e70ce095538e95ff8398defd49bb66ddfce","Volumes":null,"WorkingDir":"","Entrypoint":null,"OnBuild":null,"Labels":{}},"c reated":"2020-10-14T10:07:34.124876277Z","docker_version":"18.09.7","id":"6ed978c75173f577f023843ea61461568332f466c963e1b088d81fe676e8816c","os":"linux","parent":"bf938fec00b8d83c6d28a66dd6aa1cf76384aec8e63c7771648007b0dfce6fd8 ","throwaway":true}'}, {'v1Compatibility': '{"id":"bf938fec00b8d83c6d28a66dd6aa1cf76384aec8e63c7771648007b0dfce6fd8","created":"2020-10-14T10:07:33.97009658Z","container_config":{"Cmd":["/bin/sh ' '-c #(nop) ADD ' 'file:6098f054f12a3651c41038294c56d4a8c5c5d477259386e75ae2af763e84e683 ' 'in / "]}}'}], 'name': 'library/busybox', 'schemaVersion': 1, 'signatures': [{'header': {'alg': 'ES256', 'jwk': {'crv': 'P-256', 'kid': '4XJH:HFRU:YB6H:MXPT:C4BI:KMEY:LEEZ:C4XI:6V73:X2JX:MPND:UCMK', 'kty': 'EC', 'x': '0LayaLTMdYehsVNmUXsLFLC35jCGdXtHNcudkUUlf70', 'y': '2B0uAwUNSkbn2DhNtw-pX5Shigm0Lpm-YtcGm1Rzj-w'}}, 'protected': 'eyJmb3JtYXRMZW5ndGgiOjIxMjgsImZvcm1hdFRhaWwiOiJDbjAiLCJ0aW1lIjoiMjAyMC0xMC0zMFQxNzo1NTozMFoifQ', 'signature': 'mxFa64066f0LyZ-Xw5expe2CylulXFBFgZgcyqor6e6Xt7KI2WSrY4nNRWDY3IpjaVUWFbAa2Gv_wPQfV-8JXA'}], 'tag': 'latest'} ```

edit2: https://docs.docker.com/registry/spec/manifest-v2-2/ came to rescue. I can request "fat" list of manifests and then it lists for all archs!

```shell (Pdb) pprint(requests.get(f"{REGISTRY_ENDPOINT}/{repo}/manifests/{tag}", headers=dict(Accept="application/vnd.docker.distribution.manifest.list.v2+json", **headers)).json()) {'manifests': [{'digest': 'sha256:c9249fdf56138f0d929e2080ae98ee9cb2946f71498fc1484288e6a935b5e5bc', 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'platform': {'architecture': 'amd64', 'os': 'linux'}, 'size': 527}, {'digest': 'sha256:a7c572c26ca470b3148d6c1e48ad3db90708a2769fdf836aa44d74b83190496d', 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'platform': {'architecture': 'arm', 'os': 'linux', 'variant': 'v5'}, 'size': 527}, {'digest': 'sha256:ce800872092c37c5f20ef111a5a69c5c8e94d0c5e055f76f530cb5e78a26ec03', 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'platform': {'architecture': 'arm', 'os': 'linux', 'variant': 'v6'}, 'size': 527}, {'digest': 'sha256:6655df04a3df853b029a5fac8836035ac4fab117800c9a6c4b69341bb5306c3d', 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'platform': {'architecture': 'arm', 'os': 'linux', 'variant': 'v7'}, 'size': 527}, {'digest': 'sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0', 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'platform': {'architecture': 'arm64', 'os': 'linux', 'variant': 'v8'}, 'size': 527}, {'digest': 'sha256:ba65e8d39e89b5c16f036c88c85952756777bf5385bce148bc44be48fac37d94', 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'platform': {'architecture': '386', 'os': 'linux'}, 'size': 527}, {'digest': 'sha256:d7e83316d74e150866d82c45de342e78f662fe0aefbdb822d7d10c8b8e39cc4b', 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'platform': {'architecture': 'mips64le', 'os': 'linux'}, 'size': 527}, {'digest': 'sha256:0a11a95568b680dce6906a015bed88381e28ad17b31a63f7fec057b35573235a', 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'platform': {'architecture': 'ppc64le', 'os': 'linux'}, 'size': 528}, {'digest': 'sha256:426c855775f026d3fe76988b71938f4c9dc6840f09c0f29d8d4c75cc4238503b', 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'platform': {'architecture': 's390x', 'os': 'linux'}, 'size': 528}], 'mediaType': 'application/vnd.docker.distribution.manifest.list.v2+json', 'schemaVersion': 2} ``` and then request specific manifest for specific architecture (digest of layer would differ) and the skinny one for the tag just corresponds to the first in the list (in this case arch amd64). ``` *(Pdb) pprint(requests.get(f"{REGISTRY_ENDPOINT}/{repo}/manifests/{tag}", headers=dict(Accept="application/vnd.docker.distribution.manifest.v2+json", **headers)).json()) {'config': {'digest': 'sha256:f0b02e9d092d905d0d87a8455a1ae3e9bb47b4aa3dc125125ca5cd10d6441c9f', 'mediaType': 'application/vnd.docker.container.image.v1+json', 'size': 1493}, 'layers': [{'digest': 'sha256:9758c28807f21c13d05c704821fdd56c0b9574912f9b916c65e1df3e6b8bc572', 'mediaType': 'application/vnd.docker.image.rootfs.diff.tar.gzip', 'size': 764619}], 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'schemaVersion': 2} *(Pdb) pprint(requests.get(f"{REGISTRY_ENDPOINT}/{repo}/manifests/sha256:c9249fdf56138f0d929e2080ae98ee9cb2946f71498fc1484288e6a935b5e5bc", headers=dict(Accept="application/vnd.docker.distribution.manifest.v2+json", **headers)).json()) {'config': {'digest': 'sha256:f0b02e9d092d905d0d87a8455a1ae3e9bb47b4aa3dc125125ca5cd10d6441c9f', 'mediaType': 'application/vnd.docker.container.image.v1+json', 'size': 1493}, 'layers': [{'digest': 'sha256:9758c28807f21c13d05c704821fdd56c0b9574912f9b916c65e1df3e6b8bc572', 'mediaType': 'application/vnd.docker.image.rootfs.diff.tar.gzip', 'size': 764619}], 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', 'schemaVersion': 2} *(Pdb) pprint(requests.get(f"{REGISTRY_ENDPOINT}/{repo}/manifests/sha256:d7e83316d74e150866d82c45de342e78f662fe0aefbdb822d7d10c8b8e39cc4b", headers=dict(Accept="application/vnd.docker.distribution.manifest.v2+json", **headers)).json()) {'config': {'digest': 'sha256:1e4c9e707e11df98b555522738e99a0ece7f06cd157b53f4a0823de59a9b9478', 'mediaType': 'application/vnd.docker.container.image.v1+json', 'size': 1460}, 'layers': [{'digest': 'sha256:3aef78b622fab981bf1fab26571f5bdc024afe8a9a8e2557b20d19243798e620', 'mediaType': 'application/vnd.docker.image.rootfs.diff.tar.gzip', 'size': 948824}], 'mediaType': 'application/vnd.docker.distribution.manifest.v2+json', ```

Ok, I think mystery is somewhat solved. We can get a list of images, use those records digests to requests manifests per specific arch digest (instead of requesting "default" one for the first architecture), save that manifest along with image manifest which has dates information.

yarikoptic commented 4 years ago

now there is https://github.com/datalad/datalad-container/pull/135 , which should close this issue if merged