common-workflow-language / cwl-utils

Python utilities for CWL
https://cwl-utils.readthedocs.io/
Apache License 2.0
36 stars 18 forks source link

Do not require docker save #247

Closed jfennick closed 1 year ago

jfennick commented 1 year ago

The primary purpose of cwl-docker-extract is to pull all images in a workflow (and subworkflows, etc) and populate the default docker / --container-engine cache. Additionally copying the images to another directory is often unnecessary and takes up 2X the disk space. For large workflows that have many unoptimized images, this could approach 100GB. This PR makes copying optional (for docker only).

codecov[bot] commented 1 year ago

Codecov Report

Merging #247 (ea43afc) into main (823b687) will decrease coverage by 0.01%. The diff coverage is 68.42%.

@@            Coverage Diff             @@
##             main     #247      +/-   ##
==========================================
- Coverage   32.03%   32.03%   -0.01%     
==========================================
  Files          30       30              
  Lines       30476    30484       +8     
  Branches     9133     9137       +4     
==========================================
+ Hits         9764     9766       +2     
- Misses      18160    18169       +9     
+ Partials     2552     2549       -3     
Files Coverage Δ
cwl_utils/image_puller.py 82.89% <84.61%> (-1.83%) :arrow_down:
cwl_utils/docker_extract.py 87.50% <33.33%> (-5.84%) :arrow_down:

... and 5 files with indirect coverage changes

mr-c commented 1 year ago

The primary purpose of cwl-docker-extract is to pull all images in a workflow (and subworkflows, etc) and populate the default docker / --container-engine cache. Additionally copying the images to another directory is often unnecessary and takes up 2X the disk space. For large workflows that have many unoptimized images, this could approach 100GB. This PR makes copying optional (for docker only).

Thanks for the PR. I'm glad to hear about your use case, though it is not the original use case for this script. There are HPC systems where the compute nodes have no internet access, and thus the docket containers have to be downloaded on a different system and transferred out of band to the compute nodes.

jfennick commented 1 year ago

The primary purpose of cwl-docker-extract is to pull all images in a workflow (and subworkflows, etc) and populate the default docker / --container-engine cache. Additionally copying the images to another directory is often unnecessary and takes up 2X the disk space. For large workflows that have many unoptimized images, this could approach 100GB. This PR makes copying optional (for docker only).

Thanks for the PR. I'm glad to hear about your use case, though it is not the original use case for this script. There are HPC systems where the compute nodes have no internet access, and thus the docket containers have to be downloaded on a different system and transferred out of band to the compute nodes.

Right, I should have phrased that differently. I also have to deal with an out of band HPC machine, and I agree that is one use case.

The use case I mentioned above is a workaround for the fact that cwltool only uses docker run when executing workflows, and by default docker run will only pull an image if one does not exist locally (--pull missing). If an image exists locally, but it is not in fact the latest image, docker run will execute the old image. This can lead to "works for me" situations where someone is developing a docker image on their local machine and periodically pushing, but everyone else gets one of the stale images and thus weird errors. Perhaps I'll make a PR for cwltool and/or Toil as well.

jfennick commented 1 year ago

https://cwltool.readthedocs.io/en/latest/cli.html#cmdoption-cwltool-force-docker-pull

Nevermind, I guess there's already an option for cwltool. It would be nice if --force-docker-pull is enabled by default.