broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
160 stars 71 forks source link

Index file conventions in WDLs #578

Open mwalker174 opened 11 months ago

mwalker174 commented 11 months ago

Many files types such as VCFs, CRAMS, BED, and evidence data files have companion index files that are sometimes required for fast retrieval over specific genomic intervals.

Overview:

Some WDLs require that the file be passed in explicitly. For example:

workflow {
  inputs {
    File vcf
    File vcf_index
  }
}

In some cases, its path is inferred from the file itself. For example,

inputs {
  File vcf
}
…
File vcf_index = vcf + ".tbi"

And in other cases, the index file may not be declared at all.

Considerations:

There are several different scenarios, depending on a number of factors, that affect the decision about how to handle the indexes:

  1. Whether declaring workflow inputs or task inputs.
  2. Whether or not the index is required for at least one tool or command (e.g. GATK requires VCF indexes, but zcat does not)
  3. Whether the file is localized or not. If it's not localized, then a tool will be streaming the data directly from cloud storage. In this case, most tools implicitly assume the presence of the index at the path ${FILE_URI}.tbi, and therefore an explicit declaration of the index is not needed.

Other design factors to consider:

  1. Whether the file is a fixed resource (e.g. a reference annotation intervals list), a user-provided file (e.g. CRAM), or an intermediate file generated by the pipeline.
  2. Passing around index file paths can bloat the WDLs, input JSONs, and workflow metadata.
  3. Explicit index file inputs makes it clear to users that the file must be present (particularly important for user-provided files).
  4. Generating an index in some instance is not possible. For example, if the file is not sorted.
  5. Generating indexes "on the fly" wastes compute, assumes the file is sorted, and may cause problems with certain backends, such as on shared filesystems.
  6. In rare cases (usually for user-provided inputs), the index file will be located in a different path from the main file. In these cases, it is necessary to move both files into the same directory (e.g. using cp, mv, or ln) unless the particular tool being used supports passing in the index path explicitly (most don't). Usually we use cp or ln, as mv can create problems on shared filesystems.

Proposal:

For the sake of development and UX, we should enforce consistent conventions regarding companion index files in the WDLs.

I'd propose the following rules to balance the above issues:

Task-level inputs:

  1. If the index is needed, then the index should be a required input. This is true regardless of whether the file is localized, as file with localization_optional=true will still localize files if unsupported by the backend.
  2. If the index is not needed, then the index should not be required as to support possibly unsorted inputs (note if the task assumes sorted input, requiring the index could be a way of ensuring that's true).
  3. If the index may or may not be needed (e.g. see ConcatVcfs), then the index should be an optional input, again to support the case of possibly unsorted files.

Workflow-level inputs:

  1. If the file is a user-provided input and the index is needed, then the index file should be explicitly required. In this case, tasks consuming the file and index should not assume they are in the same path (see (9) from the Considerations section).
  2. If the file is a user-provided input and the index is not needed, then the index file should be omitted.
  3. If the file is a user-provided input and the index is optional, then the index file should be optional (e.g. File? file_index) and documentation on its usage must be present.
  4. The same rules apply if the file is a fixed resource (e.g. repeatmasker intervals list) as to user-provided inputs above.
  5. If the file is an intermediate, meaning it is produced within the pipeline either within the workflow or by a previous module, then the index input should be omitted, and the presence of the index should be assumed to be ${FILE_URI}.tbi. This will cut down substantially on the number of inputs to keep track of, and shouldn't cause issues for users in the vast majority of use cases. There are less common scenarios where this could cause errors, such as a user manually copying files and forgetting to include indexes, but this is a power-user case and really not needed for running the pipeline as prescribed.

Task calls:

  1. For workflow task calls, index inputs can be inferred (when applicable) from the companion file path, e.g.:
    call MyTask {
    input:
    vcf=input_vcf,
    vcf_index=input_vcf + ".tbi"
    }

    Note that explicitly declared indexes, from the workflow inputs or a previous task call, should be used instead when possible.

Outputs:

  1. Due to proposal (8) above, it will important to always generate index files when possible and explicitly declare them as outputs always at the task-level and at workflow-level when the companion file is also a workflow output. The workflow-level output is particularly important for using the output-copying functionality in Cromwell, and for ensuring the indexes are populated in the Terra datatable, which prevents them from being cleared by various storage cleaning tools like fiss-mop.

Resource storage:

  1. For ease of development, any "fixed resource" indexable file (e.g. reference interval lists) should have its index companion present at the expected storage URL regardless of whether the index is required by the pipeline. This facilitates development in case the index is needed or in case a user wants to download the file for analyses that require the index as well. In addition, this also will help cut down on updating resources hosted in gs://gcp-public-data--broad-references.
  2. All indexes should be explicitly listed in the input value jsons, e.g. inputs/values/resources_hg38.json to facilitate resource mirroring.

Please feel free to discuss so we can finalize these conventions.

VJalili commented 10 months ago

Thank you, @mwalker174, for putting this together; this is extremely useful.

A few thoughts: