Open bpinsard opened 1 year ago
Thank you for the issue. (Un)fortunately we do not use docker images much with datalad-containers since before advent of podman it was impossible to run them on HPCs. For that we typically used singularity containers, and even established https://github.com/ReproNim/containers (AKA ///repronim/containers
) where we have singularity containers for all bids apps etc. On OSX we run them through docker. Consider using them instead "as a workaround".
On a side note, would it make sense to add the local docker service where the dataset is installed as a special remote.
sorry, not sure what you mean by "local docker service" so that dataset is installed there.
- add a cmdline flag to not add the image as an input to the run call (at the risks of failure).
datalad run
(but apparently not containers-run
) already got --assume-ready
option which is I guess would satisfy this desire if we expose it to containers-run
and may be make it also accept extra-inputs
(or may be even an explicit image
), not just all inputs
to be more specific and then not pass them into run
as extra-inputs
and then no image would be fetched.
delegate the get of the image to the docker adapter, which would first check if the image is present in the local docker.
that is unlikely since get
is done already in datalad core's run
pretty much to ensure that extra inputs available. Delegation all the way to docker adapter sounds too far from current implementation architecture.
another immediately available solution is just to have a reckless
clone of such dataset, which would symlink .git/annex
altogether and thus nothing would really need to be get
ed if already obtained in the original one.
but then care and extra steps might be needed if you are planing to save results into that ephemeral cloned dataset since annexed keys would be saved into original repo .git/annex
, and then you would need to push back results into the tree, most likely have that throw away clone announced dead, and run git annex fsck
at the original one I believe so it realized that it is actually the original location which contains those keys.
Thanks for the detailed reply. I am aware of and using the singularity containers repo, which is indeed really useful on HPC.
I have been more recently been investigating the docker option, as we are pushing datalad dataset creation to a new level: orchestrating all the pipelines with gitlab for conversion jobs(heudiconv,.. ), preparation jobs (defacing,...) qc jobs (mriqc,...), processing jobs (*MRIPrep,...), testing jobs (bids-validator,...) through gitlab-runners. In that environment, it is simpler to run docker than singularity in docker. However, it requires optimizing data transfer, and containers are a big part of the requirements for individual jobs.
sorry, not sure what you mean by "local docker service" so that dataset is installed there.
Let say one of the job clones a dataset containing docker images (likely in a subdataset tracked for reproducibility) to a server, it is possible that the docker layers for the images required for the job are already pulled in the server docker instance (from previous jobs) and as it is tracked with docker versioning, using such layers rather than pulling/importing from the datalad dataset is more efficient without breaking reproducibility.
The ephemeral option could work (and output data will never be in the shared containers repo), but as gitlab jobs are run within docker containers (and then containers-run will likely run with docker-in-docker), it might be less obvious to properly bind the right volumes, and will be less easily portable/deployed to multiple servers.
I will investigate the --assume-ready
option, and maybe try to PR what you suggested.
@bpinsard thanks for the idea re a docker daemon special remote. It sounds appealing, but I need to think about this more.
Looking at the flow in the code, I think a docker-specific intervention might be less complex. Before containers-run
executes run
internally (which would trigger the download, we can:
image_path
and detect if it is docker basedIn principle it would be even better to not export them at all, but simply skip their download/ingestion to docker entirely, but this would be a much larger break from the annexed-file paradigm.
I looked into this further. Pretty much all the necessary functionality is already implemented in the docker adaptor. After this last exploration I am considering the following approach
cmd
and detect if the docker adapter is usedimage_path
for unavailable filesdocker images
for required image (matches .json
files in image_path
)git annex reinject --known
git annex fix
on the image_path
afterwards, should be fastHere is a draft implementation that shows the basics of the approach.
A helper is added to the docker adaptor (see inline comments):
diff --git a/datalad_container/adapters/docker.py b/datalad_container/adapters/docker.py
index 8949852..1a18146 100644
--- a/datalad_container/adapters/docker.py
+++ b/datalad_container/adapters/docker.py
@@ -13,6 +13,7 @@ import hashlib
import json
import os
import os.path as op
+from pathlib import Path
import subprocess as sp
import sys
import tarfile
@@ -153,6 +154,51 @@ def load(path, repo_tag, config):
return image_id
+def repopulate_from_daemon(contds, imgpath: Path) -> None:
+ # crude check whether anything at the image location is not
+ # locally present
+ contrepo = contds.repo
+ if contrepo.call_annex(
+ ['find', '--not', '--in', 'here'],
+ files=str(imgpath),
+ ):
+ # a docker image is a collection of files in a directory
+ assert imgpath.is_dir()
+ # we could look into `manifest.json`, but it might also be
+ # annexed and not around. instead look for the config filename
+ imgcfg = [
+ p.name for p in imgpath.iterdir()
+ if len(p.name) == 69 and p.name.endswith('.json')
+ ]
+ # there is only one
+ assert len(imgcfg) == 1
+
+ # look for the employed annex backend
+ backends = set(contrepo.call_annex_oneline([
+ 'find',
+ f'--branch=HEAD:{imgpath.relative_to(contds.pathobj)}',
+ '--anything',
+ '--format=${backend} ']).split())
+ # we can only deal with a single homogeneous backend here
+ assert len(backends) == 1
+
+ # ID is filename, minus .json extension
+ img_id = imgcfg[0][:-5]
+ with tempfile.TemporaryDirectory(dir=imgpath) as tmpdir:
+ # try to export the image from a local docker instance
+ save(f'sha256:{img_id}', tmpdir)
+ # the line above will raise an exception when
+ # - this docker does not have the image.
+ # - or there is not docker running at all.
+ contrepo.call_annex(
+ ['reinject', '--known', '--backend', backends.pop()],
+ files=[
+ str(p) for p in Path(tmpdir).glob('**/*')
+ if p.is_file()
+ ],
+ )
+
+
# Command-line
This helper is executed just before datalad-run
on a "best-effort" basis
diff --git a/datalad_container/containers_run.py b/datalad_container/containers_run.py
index d7b4e79..eabb448 100644
--- a/datalad_container/containers_run.py
+++ b/datalad_container/containers_run.py
@@ -4,6 +4,7 @@ __docformat__ = 'restructuredtext'
import logging
import os.path as op
+from pathlib import Path
import sys
from datalad.interface.base import Interface
@@ -163,6 +164,26 @@ class ContainersRun(Interface):
lgr.debug("extra_inputs = %r", extra_inputs)
+ if '-m datalad_container.adapters.docker run' in cmd:
+ # this will use the docker adapter to execute the container.
+ # let's check whether pieces of the container image are missing.
+ # if so, query the local docker, if it already has the respective
+ # image, in order to populate the dataset without having to go
+ # through a download -- pretty much an adhoc docker-special remote
+ from datalad_container.adapters.docker import repopulate_from_daemon
+ contds = require_dataset(
+ container['parentds'], check_installed=True,
+ purpose='check for docker images')
+ try:
+ repopulate_from_daemon(
+ contds,
+ # we use the container report here too, and not any of the
+ # processed variants from above to stay internally consistent
+ imgpath=Path(container['path']),
+ )
+ except Exception as e:
+ pass
+
with patch.dict('os.environ',
{CONTAINER_NAME_ENVVAR: container['name']}):
# fire!
Issues:
So it turns out that the manifest depends on how docker save
is called.
When sha256:<id>
is used (like in the patch above), the manifest.json
has "RepoTags":null
. Normal operation (like the initial add) would have the container name here, i.e. "RepoTags":["busybox:latest"]
.
This complicates things a bit. When the manifest is annexed, we would need to match the query to its content (that we do not have necessarily). The generic approach would be to parse docker images
(via _list_images()
in the docker adaptor.
I resolved the repo-tag issues. I will spend some time on documenting this, and then open a PR.
If ones repetitively clones datasets that contains a docker image on the same machine, it will need to get the layers from the remote while it could already be pulled in the local docker. For example, for parallel processing with fMRIPrep, one will clone the dataset as temp clone, but will need to get the docker layers of the fMRIPrep image, even if these are already present in the docker instance.
For now the
datalad containers-run
will run the image if present in docker but still tries fetching the layer, doing unnecessary costly get operations. For example, if the layers were dropped recklessly, it tries to get the layer, fails, but runs the container anyway.I guess it would require to change
https://github.com/datalad/datalad-container/blob/66941a9414973d60926fef42684c4a2282c988f3/datalad_container/containers_run.py#L141
I can see 2 options:
On a side note, would it make sense to add the local docker service where the dataset is installed as a special remote.