broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
972 stars 354 forks source link

Feature request: If provided intermediate files within workflow, ability to skip previous steps #2949

Closed sooheelee closed 4 years ago

sooheelee commented 6 years ago

This would greatly reduce my need to modify WDL scripts to start where I have data already processed. For example, if a script goes BAM-->coverage-->CNVs, if I have already collected coverage on my BAMs, I would like to be able to provide coverage to the same script and have Cromwell skip the tasks involving the BAM and run the remaining steps in the workflow, e.g. coverage-->CNVs.

I run WDLs using gcloud, within a VM and locally. I don't use FireCloud so my runs do not use call-caching.

I want to take the boilerplate WDL scripts the GATK4 repo makes available to run processes. I am specifically looking at the latest somatic CNV workflow. If I have alreaded padded my intervals and/or collected counts on the BAMs, I'd like to still use the rest of the steps in the workflow by specifying in the INPUTS JSON an intermediate file.

If the script is thus:

   call CNVTasks.PreprocessIntervals {
        input:
            intervals = intervals,
            ref_fasta_dict = ref_fasta_dict,
            gatk4_jar_override = gatk4_jar_override,
            gatk_docker = gatk_docker
    }

    if (select_first([do_explicit_gc_correction, false])) {
        call CNVTasks.AnnotateIntervals {
            input:
                intervals = PreprocessIntervals.preprocessed_intervals,
                ref_fasta = ref_fasta,
                ref_fasta_fai = ref_fasta_fai,
                ref_fasta_dict = ref_fasta_dict,
                gatk4_jar_override = gatk4_jar_override,
                gatk_docker = gatk_docker
        }

In the inputs, instead of defining:

"CNVSomaticPanelWorkflow.intervals": "File",

I would like to be able to instead provide:

"CNVSomaticPanelWorkflow.PreprocessIntervals.preprocessed_intervals": "File",

And not have the run error due to the lack of the CNVSomaticPanelWorkflow.intervals file.

I would really appreciate such a feature as it saves me the time of having to rewrite WDL scripts for each tweaked subset workflow. Thanks.

geoffjentry commented 4 years ago

We believe that using a persistent database would solve this via call caching, even in a local environment