broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
988 stars 357 forks source link

How to call samtools index as it's own task? #6182

Open ghost opened 3 years ago

ghost commented 3 years ago

Hi, I am trying to create two task. MarkDuplicates followed up Samtools index of that MarkDuplicates. Both from separate dockers. The issue I think is that when samtools indexes a bam file it does it in the same directory. So if the input to samtools is output from MarkDuplicates then I get this output error:

java.io.FileNotFoundException: Could not process output, file not found: /home/coyote/cromwell/WGS/cromwell-executions/AlignBwaMem/87cfebcc-b103-4e15-8313-0f39a8a959d5/call-md/shard-1/execution/SRR13481471.md.bam.bai

because for some reason both SRR13481471.md.bam and SRR13481471.md.bam.bai and in the folder inputs for samtools to index, i.e. samtools is indexing the input folder bam and the resulting index goes to that inputs folder and not the execution folder. How do we handle this? There must be a way since this must happen alot. Why does cromwell think the output index file should be in the MarkDuplictes folder "md" in my case

Here is my calls and tasks:

call md_bqsr.markdupsIndiv as md {
    input :
        bam = samsort.outputBam,
        outputPrefix = s.outputPrefix,
        runtime_params = standard_runtime_gatk
}

call samtools.index as samindex {
    input :
        bam = md.bamMD,
        runtime_params = standard_runtime_samtools
}

task markdupsIndiv {
    input {
        File bam
        String outputPrefix

        # runtime
        RuntimeGATK runtime_params
    }

    command {
        set -e -o pipefail
        export GATK_LOCAL_JAR=~{default="/root/gatk.jar" runtime_params.gatk_override}

        gatk --java-options "-Xmx~{runtime_params.memory}m" MarkDuplicates \
            -I ~{bam} \
            -O ~{outputPrefix}.md.bam \
            -M ~{outputPrefix}.md.metrics
    }

    runtime {
        maxRetries: runtime_params.max_retries
        memory: runtime_params.memory + " MB"
        cpu: runtime_params.cpu
        docker: runtime_params.gatk_docker
    }

    output {
        File bamMD = "~{outputPrefix}.md.bam"
        File metrics = "~{outputPrefix}.md.metrics"
    }   
}

task index {
    input {
        File bam

        # runtime
        RuntimeSamtools runtime_params
    }

    command {
        set -e -o pipefail

        /opt/samtools/bin/samtools index ~{bam} ~{bam}.bai
    }

    runtime {
        maxRetries: runtime_params.max_retries
        memory: runtime_params.memory + " MB"
        cpu: runtime_params.cpu
        docker: runtime_params.samtools_docker
    }

    output {
        File indexedBam = "~{bam}.bai"
    } 

}

I didn't have this issue when I locally called samtools index in the MarkDuplicate task without Dockers.

EDIT: It seems like with CWL this is also issue that is resolved in copying the index file from the input to the output directory. How do we do this in WDL?

Arguments: [
    "/opt/samtools/bin/samtools", "index", "$(runtime.outdir)/$(inputs.bam.basename)", "$(runtime.outdir)/$(inputs.bam.basename).bai",
    { valueFrom: " && ", shellQuote: false },
    "cp", "$(inputs.bam.basename).bai", "$(runtime.outdir)/$(inputs.bam.nameroot).bai"
]
requirements:
    - class: ShellCommandRequirement
    - class: DockerRequirement
      dockerPull: "mgibio/samtools-cwl:1.0.0"
    - class: ResourceRequirement
      ramMin: 4000
    - class: InitialWorkDirRequirement
      listing:
        - ${ var f = inputs.bam; delete f.secondaryFiles; return f }
    - class: InlineJavascriptRequirement
illusional commented 3 years ago

Hey @bolton-lab, just FYI this is probably more of a WDL forum sort of thing. But generally, you've noted you don't want to perform execution where your inputs are localised to, if you need to mutate or reuse them, you should copy them to your execution folder by adding a copy in your command block together with the basename function, eg:

# task index {
    command {
        set -e -o pipefail
        cp ~{bam} ~{basename(bam)}
        /opt/samtools/bin/samtools index ~{basename(bam)} ~{basename(bam)}.bai
    }