datalad / datalad-container

DataLad extension for containerized environments
http://datalad.org
Other
11 stars 17 forks source link

docker: do not get layers if present in local docker service #199

Open bpinsard opened 1 year ago

bpinsard commented 1 year ago

If ones repetitively clones datasets that contains a docker image on the same machine, it will need to get the layers from the remote while it could already be pulled in the local docker. For example, for parallel processing with fMRIPrep, one will clone the dataset as temp clone, but will need to get the docker layers of the fMRIPrep image, even if these are already present in the docker instance.

For now the datalad containers-run will run the image if present in docker but still tries fetching the layer, doing unnecessary costly get operations. For example, if the layers were dropped recklessly, it tries to get the layer, fails, but runs the container anyway.

$ datalad  containers-run -n alpine
[INFO   ] Making sure inputs are available (this may take some time) 
get(error): .datalad/environments/alpine/image/31f5b0c484b3651f7c883d1be2ef442c2da71d103bc7ab20cd0b1f825e67e4e7/layer.tar (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(impossible): .datalad/environments/alpine/image (directory) [could not get some content in /home/basile/data/tests/dataset_test/.datalad/environments/alpine/image ['/home/basile/data/tests/dataset_test/.datalad/environments/alpine/image/31f5b0c484b3651f7c883d1be2ef442c2da71d103bc7ab20cd0b1f825e67e4e7/layer.tar']]
[INFO   ] == Command start (output follows) ===== 
/tmp $ 
/tmp $ 
[INFO   ] == Command exit (modification check follows) ===== 
run(ok): /home/basile/data/tests/dataset_test (dataset) [python3 -m datalad_container.adapters.do...]
action summary:                                                                                                                                           
  get (error: 1, impossible: 1)
  run (ok: 1)
  save (notneeded: 1)

I guess it would require to change

https://github.com/datalad/datalad-container/blob/66941a9414973d60926fef42684c4a2282c988f3/datalad_container/containers_run.py#L141

I can see 2 options:

On a side note, would it make sense to add the local docker service where the dataset is installed as a special remote.

yarikoptic commented 1 year ago

Thank you for the issue. (Un)fortunately we do not use docker images much with datalad-containers since before advent of podman it was impossible to run them on HPCs. For that we typically used singularity containers, and even established https://github.com/ReproNim/containers (AKA ///repronim/containers) where we have singularity containers for all bids apps etc. On OSX we run them through docker. Consider using them instead "as a workaround".

On a side note, would it make sense to add the local docker service where the dataset is installed as a special remote.

sorry, not sure what you mean by "local docker service" so that dataset is installed there.

  • add a cmdline flag to not add the image as an input to the run call (at the risks of failure).

datalad run (but apparently not containers-run) already got --assume-ready option which is I guess would satisfy this desire if we expose it to containers-run and may be make it also accept extra-inputs (or may be even an explicit image), not just all inputs to be more specific and then not pass them into run as extra-inputs and then no image would be fetched.

delegate the get of the image to the docker adapter, which would first check if the image is present in the local docker.

that is unlikely since get is done already in datalad core's run pretty much to ensure that extra inputs available. Delegation all the way to docker adapter sounds too far from current implementation architecture.

another immediately available solution is just to have a reckless clone of such dataset, which would symlink .git/annex altogether and thus nothing would really need to be geted if already obtained in the original one.

Here is demo and you would likely to use the same if containers were singularity images ```shell ❯ datalad --dbg containers-run -n alpine-310 uname -a [INFO ] Making sure inputs are available (this may take some time) [INFO ] == Command start (output follows) ===== Linux f435385a0d00 6.1.0-3-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.8-1 (2023-01-29) x86_64 Linux [INFO ] == Command exit (modification check follows) ===== run(ok): /tmp/test-docker (dataset) [python3.11 -m datalad_container.adapters...] action summary: get (notneeded: 1) run (ok: 1) save (notneeded: 1) ❯ datalad clone $PWD $PWD-fullclone; datalad -C $PWD-fullclone containers-run -n alpine-310 uname -a install(ok): /tmp/test-docker-fullclone (dataset) [INFO ] Making sure inputs are available (this may take some time) get(ok): .datalad/environments/alpine-310/image/d7a1cfafe8443e658337f34911cbdb26304eaf248b182ee2cabd093b5e9edab2/VERSION (file) [from origin...] get(ok): .datalad/environments/alpine-310/image/d7a1cfafe8443e658337f34911cbdb26304eaf248b182ee2cabd093b5e9edab2/json (file) [from origin...] get(ok): .datalad/environments/alpine-310/image/d7a1cfafe8443e658337f34911cbdb26304eaf248b182ee2cabd093b5e9edab2/layer.tar (file) [from origin...] get(ok): .datalad/environments/alpine-310/image/e7b300aee9f9bf3433d32bc9305bfdd22183beb59d933b48d77ab56ba53a197a.json (file) [from origin...] get(ok): .datalad/environments/alpine-310/image/manifest.json (file) [from origin...] get(ok): .datalad/environments/alpine-310/image/repositories (file) [from origin...] get(ok): .datalad/environments/alpine-310/image (directory) [INFO ] == Command start (output follows) ===== Linux 8b267f7059b8 6.1.0-3-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.8-1 (2023-01-29) x86_64 Linux [INFO ] == Command exit (modification check follows) ===== run(ok): /tmp/test-docker-fullclone (dataset) [python3.11 -m datalad_container.adapters...] action summary: get (ok: 7) run (ok: 1) save (notneeded: 1) /tmp/test-docker master ▓▒░────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────░▒▓ dev3.11 Py 09:29:10 P❯ datalad clone --reckless ephemeral $PWD $PWD-ephemeral; datalad -C $PWD-ephemeral containers-run -n alpine-310 uname -a install(ok): /tmp/test-docker-ephemeral (dataset) [INFO ] Making sure inputs are available (this may take some time) [INFO ] == Command start (output follows) ===== Linux a04ba1bd2d45 6.1.0-3-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.8-1 (2023-01-29) x86_64 Linux [INFO ] == Command exit (modification check follows) ===== run(ok): /tmp/test-docker-ephemeral (dataset) [python3.11 -m datalad_container.adapters...] action summary: get (notneeded: 1) run (ok: 1) save (notneeded: 1) ```

but then care and extra steps might be needed if you are planing to save results into that ephemeral cloned dataset since annexed keys would be saved into original repo .git/annex, and then you would need to push back results into the tree, most likely have that throw away clone announced dead, and run git annex fsck at the original one I believe so it realized that it is actually the original location which contains those keys.

bpinsard commented 1 year ago

Thanks for the detailed reply. I am aware of and using the singularity containers repo, which is indeed really useful on HPC.

I have been more recently been investigating the docker option, as we are pushing datalad dataset creation to a new level: orchestrating all the pipelines with gitlab for conversion jobs(heudiconv,.. ), preparation jobs (defacing,...) qc jobs (mriqc,...), processing jobs (*MRIPrep,...), testing jobs (bids-validator,...) through gitlab-runners. In that environment, it is simpler to run docker than singularity in docker. However, it requires optimizing data transfer, and containers are a big part of the requirements for individual jobs.

sorry, not sure what you mean by "local docker service" so that dataset is installed there.

Let say one of the job clones a dataset containing docker images (likely in a subdataset tracked for reproducibility) to a server, it is possible that the docker layers for the images required for the job are already pulled in the server docker instance (from previous jobs) and as it is tracked with docker versioning, using such layers rather than pulling/importing from the datalad dataset is more efficient without breaking reproducibility.

The ephemeral option could work (and output data will never be in the shared containers repo), but as gitlab jobs are run within docker containers (and then containers-run will likely run with docker-in-docker), it might be less obvious to properly bind the right volumes, and will be less easily portable/deployed to multiple servers.

I will investigate the --assume-ready option, and maybe try to PR what you suggested.

mih commented 11 months ago

@bpinsard thanks for the idea re a docker daemon special remote. It sounds appealing, but I need to think about this more.

Looking at the flow in the code, I think a docker-specific intervention might be less complex. Before containers-run executes run internally (which would trigger the download, we can:

In principle it would be even better to not export them at all, but simply skip their download/ingestion to docker entirely, but this would be a much larger break from the annexed-file paradigm.

mih commented 11 months ago

I looked into this further. Pretty much all the necessary functionality is already implemented in the docker adaptor. After this last exploration I am considering the following approach

mih commented 11 months ago

Here is a draft implementation that shows the basics of the approach.

A helper is added to the docker adaptor (see inline comments):

diff --git a/datalad_container/adapters/docker.py b/datalad_container/adapters/docker.py
index 8949852..1a18146 100644
--- a/datalad_container/adapters/docker.py
+++ b/datalad_container/adapters/docker.py
@@ -13,6 +13,7 @@ import hashlib
 import json
 import os
 import os.path as op
+from pathlib import Path
 import subprocess as sp
 import sys
 import tarfile
@@ -153,6 +154,51 @@ def load(path, repo_tag, config):
     return image_id

+def repopulate_from_daemon(contds, imgpath: Path) -> None:
+    # crude check whether anything at the image location is not
+    # locally present
+    contrepo = contds.repo
+    if contrepo.call_annex(
+        ['find', '--not', '--in', 'here'],
+        files=str(imgpath),
+    ):
+        # a docker image is a collection of files in a directory
+        assert imgpath.is_dir()
+        # we could look into `manifest.json`, but it might also be
+        # annexed and not around. instead look for the config filename
+        imgcfg = [
+            p.name for p in imgpath.iterdir()
+            if len(p.name) == 69 and p.name.endswith('.json')
+        ]
+        # there is only one
+        assert len(imgcfg) == 1
+
+        # look for the employed annex backend
+        backends = set(contrepo.call_annex_oneline([
+            'find',
+            f'--branch=HEAD:{imgpath.relative_to(contds.pathobj)}',
+            '--anything',
+            '--format=${backend} ']).split())
+        # we can only deal with a single homogeneous backend here
+        assert len(backends) == 1
+
+        # ID is filename, minus .json extension
+        img_id = imgcfg[0][:-5]
+        with tempfile.TemporaryDirectory(dir=imgpath) as tmpdir:
+            # try to export the image from a local docker instance
+            save(f'sha256:{img_id}', tmpdir)
+            # the line above will raise an exception when
+            # - this docker does not have the image.
+            # - or there is not docker running at all.
+            contrepo.call_annex(
+                ['reinject', '--known', '--backend', backends.pop()],
+                files=[
+                    str(p) for p in Path(tmpdir).glob('**/*')
+                    if p.is_file()
+                ],
+            )
+
+
 # Command-line

This helper is executed just before datalad-run on a "best-effort" basis

diff --git a/datalad_container/containers_run.py b/datalad_container/containers_run.py
index d7b4e79..eabb448 100644
--- a/datalad_container/containers_run.py
+++ b/datalad_container/containers_run.py
@@ -4,6 +4,7 @@ __docformat__ = 'restructuredtext'

 import logging
 import os.path as op
+from pathlib import Path
 import sys

 from datalad.interface.base import Interface
@@ -163,6 +164,26 @@ class ContainersRun(Interface):

         lgr.debug("extra_inputs = %r", extra_inputs)

+        if '-m datalad_container.adapters.docker run' in cmd:
+            # this will use the docker adapter to execute the container.
+            # let's check whether pieces of the container image are missing.
+            # if so, query the local docker, if it already has the respective
+            # image, in order to populate the dataset without having to go
+            # through a download -- pretty much an adhoc docker-special remote
+            from datalad_container.adapters.docker import repopulate_from_daemon
+            contds = require_dataset(
+                container['parentds'], check_installed=True,
+                purpose='check for docker images')
+            try:
+                repopulate_from_daemon(
+                    contds,
+                    # we use the container report here too, and not any of the
+                    # processed variants from above to stay internally consistent
+                    imgpath=Path(container['path']),
+                )
+            except Exception as e:
+                pass
+
         with patch.dict('os.environ',
                         {CONTAINER_NAME_ENVVAR: container['name']}):
             # fire!

Issues:

So it turns out that the manifest depends on how docker save is called.

When sha256:<id> is used (like in the patch above), the manifest.json has "RepoTags":null. Normal operation (like the initial add) would have the container name here, i.e. "RepoTags":["busybox:latest"].

This complicates things a bit. When the manifest is annexed, we would need to match the query to its content (that we do not have necessarily). The generic approach would be to parse docker images (via _list_images() in the docker adaptor.

mih commented 11 months ago

I resolved the repo-tag issues. I will spend some time on documenting this, and then open a PR.