Open yarikoptic opened 4 years ago
@kyleam - do you some easy wrapper around docker to given a docker hub "repository" (such as https://hub.docker.com/_/neurodebian) to mint all needed calls to
datalad containers-add
?
Hmm, does this come down to wanting a containers-add
call for each tag in the repository? If so, perhaps something like this:
import requests
endpoint = "https://hub.docker.com/v2"
repo = "neurodebian"
next_page = f"{endpoint}/repositories/library/{repo}/tags"
while next_page:
print(next_page)
response = requests.get(next_page)
response.raise_for_status()
data = response.json()
next_page = data.get("next")
for result in data.get("results"):
print(result["name"])
# containers-add call if container isn't already present
I've pushed a rough prototype, which may or may not be viable start, to kyleam/datalad-container-dhub-tags (obviously not its final home if it ends up being useful). python containers_add_dhub_tags.py --help
will give a few more details, but basically
echo neurodebian | python containers_add_dhub_tags.py
should lead to a containers-add --url dhub:// ...
call for each tag under the neurodebian repository.
echo repronim/ | python containers_add_dhub_tags.py
should get a containers-add --url dhub:// ...
call for each tag in each repository under repronim.
awesome -- I will give it a shot now. I am just curious -- why stdin? ;-)
It works! Some observations I would like your input on -- I might be not seeing some possible issues
neurodebian-f5098f2
for dhub://library/neurodebian:xenial-non-free it to be just neurodebian--xenial-non-free
(or have library-
prefix for consistency) or if we want really to include the image hexsha (they aren't sortable -- humans would get confused which one to use) could be neurodebian--xenial-non-free--f5098f2
(we cannot use .
I believe in container names, otherwise it could have been a dot)..datalad/images
but to establish uniform hierarchy of directories reflecting those docker hub repositories etc., so smth like
datalad containers-add neurodebian--xenial-non-free -u dhub://library/neurodebian:xenial-non-free -i dhub/library/neurodebian/xenial-non-free--f5098f2
then it would be easy to navigate/see how many we have without requiring datalad containers-list
(or knowing to look/grep .datalad/config
).
NB it could even be -i dhub://library/neurodebian/xenial-non-free/f5098f2
thus really reflecting original URL ;)
- since there is only one version for a tag at a given point in time when we run it, and otherwise information about the tag would be lost, why not to create those container names following tags, e.g. instead of
neurodebian-f5098f2
for dhub://library/neurodebian:xenial-non-free it to be justneurodebian--xenial-non-free
[...]
For the initial pass, the digest gave me an easy thing to use without worrying about cleaning the names. There is (or was, after the latest push) a todo comment about this. I've switched it to cleaning the names now.
- even though not immediately usable, it would imho still be useful then to place those images not under flat
.datalad/images
but to establish uniform hierarchy of directories reflecting those docker hub repositories etc., so smth like
I've added namespace/repo subdirectories under the directories for images and manifests.
Please feel free to tweak and to move the script to wherever you'd like it.
@kyleam do you know if there is some associated with manifest date (I do not see anything in manifest
)?
do you know if there is some associated with manifest date (I do not see anything in
manifest
)?
I'm not sure I understand your question, but either way I'm confident that I don't know.
FWIW, I see that response headers for the tag contain only current Date, not some Last-modified
*(Pdb) p resp_man.headers
{'Content-Length': '2189', 'Content-Type': 'application/vnd.docker.distribution.manifest.v2+json', 'Docker-Conten
t-Digest': 'sha256:9e131ac6f30d682d71cbdbcd98e0c40b0b730e179172535dce4c5a82a2283c26', 'Docker-Distribution-Api-Ve
rsion': 'registry/2.0', 'Etag': '"sha256:9e131ac6f30d682d71cbdbcd98e0c40b0b730e179172535dce4c5a82a2283c26"', 'Dat
e': 'Fri, 30 Oct 2020 16:47:35 GMT', 'Strict-Transport-Security': 'max-age=31536000'}
why am I asking -- I thought to add the datetime to the image directory/name so it would provide ordering among images
FWIW, found the dates in the listing of tags:
*(Pdb) pprint(data)
{'count': 72,
'next': 'https://hub.docker.com/v2/repositories/library/neurodebian/tags?page=2',
'previous': None,
'results': [{'creator': 2215,
'full_size': 46057490,
'id': 13429244,
'image_id': None,
'images': [{'architecture': 'amd64',
'digest': 'sha256:9e131ac6f30d682d71cbdbcd98e0c40b0b730e179172535dce4c5a82a2283c26',
'features': '',
'last_pulled': '2020-10-30T14:15:58.27871Z',
'last_pushed': '2020-10-23T18:40:17.764065Z',
'os': 'linux',
'os_features': '',
'os_version': None,
'size': 46057490,
'status': 'active',
'variant': None}],
'last_updated': '2020-10-23T18:40:49.940676Z',
'last_updater': 1156886,
'last_updater_username': 'doijanky',
'name': 'xenial-non-free',
'repository': 42825,
'tag_last_pulled': '2020-10-30T14:15:58.27871Z',
'tag_last_pushed': '2020-10-23T18:40:49.940676Z',
'tag_status': 'active',
'v2': True},
so will take last_pushed
for the image which might (unless the same image is re-pushed periodically for some reason?). also it suggested that we might want to encode architecture as well generally speaking, and may be even dump that image information record along side.
eh, confused between those "images" (multiple per tag) and then "manifest"'s digest being just one and do correspond to the image ID (but not really to digest record in the list of images), e.g. if I look at "busybox" (with bunch of architectures):
So is "image id" (which is in the manifest) the same across architectures?
yeap -- manifest is for all archs. Then it is possible to pull a specific image by giving that digest from images:
$> docker pull busybox@sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0
sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0: Pulling from library/busybox
5dce72bf4214: Pull complete
Digest: sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0
Status: Downloaded newer image for busybox@sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0
$> docker images --digests --all | grep busybox
busybox <none> sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0 65a89d0f0344 2 weeks ago 1.4MB
busybox latest sha256:a9286defaba7b3a519d585ba0e37d0b2cbee74ebfe590960b0b1d6a5e97d1e1d f0b02e9d092d 2 weeks ago 1.23MB
busybox latest sha256:c9249fdf56138f0d929e2080ae98ee9cb2946f71498fc1484288e6a935b5e5bc f0b02e9d092d 2 weeks ago 1.23MB
which is I guess what docker pull
does -- chooses the one for matching architecture, but could be hand-twisted to run for the one desired (arm64 in this example)
$> docker run -it --rm busybox@sha256:b8946184ce3ad6b4a09ebad2d85e81cfcaadc6897bfae2e9c6e2a4fe6afa6ee0
standard_init_linux.go:178: exec user process caused "exec format error"
so, to support dumping images for various architectures there should also be a layer with <arch>-digest/
.
another note (sorry for abusing this issue for that) summary: so far found no "datetime" for tag manifest, and tag manifest could change while referenced images would stay the same (and the same dates) (or change), so there in principle no "datetime" for tag (manifest), but for the images... but the next confusion point -- manifest record lists layers (a single one) -- how come that only 1 (I bet different architectures images have different ones... uff)
edit: if I do not request specific type of record for manifest, returned record does match described in API doc and returns record for a specific architecture
edit2: https://docs.docker.com/registry/spec/manifest-v2-2/ came to rescue. I can request "fat" list of manifests and then it lists for all archs!
Ok, I think mystery is somewhat solved. We can get a list of images, use those records digests to requests manifests per specific arch digest (instead of requesting "default" one for the first architecture), save that manifest along with image manifest which has dates information.
now there is https://github.com/datalad/datalad-container/pull/135 , which should close this issue if merged
e.g. immediate usecase is "backup" of all neurodebian containers from https://hub.docker.com/_/neurodebian . I only wonder if it should all be dumped here or should be a separate dataset (probably separate). @kyleam - do you some easy wrapper around docker to given a docker hub "repository" (such as https://hub.docker.com/_/neurodebian) to mint all needed calls to
datalad containers-add
?