broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
170 stars 71 forks source link

Various docker build inconsistencies #515

Open MattWellie opened 1 year ago

MattWellie commented 1 year ago

Feature request

Slightly relevant to #513

Module(s) or script(s) involved

scripts/docker/build_docker.py inputs/values/dockers.json

Description

Apologies if this is a somewhat messy issue, I'll aim to revisit and polish.

We are currently trying to adopt GATK-SV into our own workflows, and to absorb all cost for this we want to build local versions of each relevant Docker image from scratch.

I've noticed a few idiosyncrasies in the build_docker.py script (some of which may be caused by my incorrect parsing of the process. If there is relevant documentation to mitigate these points please could you direct me to it!):

  1. It is not clear from the dockers.json file which images are built by this repo, and which images we would need to copy in from a public host
  2. the names of the docker files do not correspond to the image keys in the dockers.json dictionary. This makes it challenging to generate a usable dockers.json without bootstrapping from your existing files:
    • the dependencies matrix sets up the prerequisite builds for each image, as well as assigning each a key (e.g. samtools-cloud)
    • When writing the newly created images into the dockers.json file, the key to use is determined by the method get_target_from_image. This method determines each image's key based on the current docker image path, rather than the key in the dependencies matrix, or the named directory where each Dockerfile is found, e.g.
      
      # dockers.json
      {
      "samtools_cloud_docker": "artifactory/samtools-cloud:tag"
      }

dependency matrix entry

dependencies = { "samtools-cloud": ImageDependencies( git_dependencies="dockerfiles/samtools-cloud/*", docker_dependencies={ "sv-base-mini": "MINIBASE_IMAGE", "samtools-cloud-virtual-env": "VIRTUAL_ENV_IMAGE"} ) }

key definition

get_target_from_image("artifactory/samtools-cloud:tag") == 'samtools_cloud_docker'

  - `dockerfiles/sv-pipeline/Dockerfile` encodes a single image, but it is referred to in both the WDL files and `dockers.json` by all these various names.

"sv_pipeline_base_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63", "sv_pipeline_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63", "sv_pipeline_hail_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63", "sv_pipeline_updates_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63", "sv_pipeline_qc_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63", "sv_pipeline_rdtest_docker": "us.gcr.io/broad-dsde-methods/gatk-sv/sv-pipeline:2023-03-07-v0.26.10-beta-0c876f63"



---

My issue summarised is:

If I use an empty dockers.json and set the `--targets` argument to `all` when building images:

- Many images are not built at all (e.g. wham, gatk) as they are not defined in this repo. I haven't found documentation that explains in advance which images are built here and which are exceptions.
- The `dockers.json` lacks many images by key; the duplication of a single image across multiple keys requires bootstrapping from the `dockers.json` file pointing to your own images, e.g. as above `sv-pipeline` is populated as a single key, but WDL files expect it referenced by all 6 various names.
- The images which are populated in `dockers.json` are not accepted by the WDL files contained here, e.g. `samtools-cloud` is built, but `samtools_cloud_docker` is required (note both `_docker` suffix and potholing instead of hyphenation)

---

Proposal?

- Enforce consistency between the image `key` in the dependency matrix, the name of the created docker image, and the label in the `dockers.json` file
- Remove all many-to-one labels in the `dockers.json`, and ensure that throughout the WDL files each image is only referenced by a single key
- Separate out external and internal docker images based on which are created by this repository and which are not. The image building process should populate one file fully, and that should be joined at runtime with all additional required images.
MattWellie commented 1 year ago

@VJalili I've made a PR in our fork of this codebase, removing the duplication whereby the same image is referred to by multiple names - I'm still testing it but so far so good 😬

https://github.com/populationgenomics/gatk-sv/pull/15

VJalili commented 1 year ago

@MattWellie that is a substantial refactoring! I hope it passes all your tests.

VJalili commented 5 months ago
  • dockerfiles/sv-pipeline/Dockerfile encodes a single image, but it is referred to in both the WDL files and dockers.json by all these various names.

Proposal?

  • Remove all many-to-one labels in the dockers.json, and ensure that throughout the WDL files each image is only referenced by a single key

Agree; it is confusing and adds unneeded complexities for configuring the workflows.

This is addressed in PRs #634, #638, #639, and #645.