broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
993 stars 360 forks source link

Backend resources allocation on local runs #6966

Open Overcraft90 opened 1 year ago

Overcraft90 commented 1 year ago

Are you seeing something that looks like a bug? Please attach as much information as possible. No. The job is killed because of memory limitations e.g /bin/bash: line 1: 172 Killed

Which backend are you running? I'm running Cromwell on a local machine, I should have enough memory to run the process but somehow to amount visible/accessible by Cromwell is limited.

Paste/Attach your workflow if possible:

version 1.0

workflow step2 {
    input {
        String PANGENIE_CONTAINER = "overcraft90/eblerjana_pangenie:2.1.2"

        File FORWARD_FASTQ # compressed R1
        File REVERSE_FASTQ # compressed R2
        String NAME = "sample" # how to loop over samples' name in numerical order (maybe grub names' prefix)!?

        File PANGENOME_VCF # input vcf with variants to be genotyped
        File REF_GENOME # reference for variant calling
        String VCF_PREFIX = "genotype" # string to attach to a sample's genotype
        String EXE_PATH = "/app/pangenie/build/src/PanGenie" # path to PanGenie executable in Docker

        Int CORES = 24 # number of cores to allocate for PanGenie execution
        Int DISK = 300 # storage memory for output files
        Int MEM = 100 # RAM memory allocated
    }

    call reads_extraction_and_merging {
        input:
        in_container_pangenie=PANGENIE_CONTAINER,
        in_forward_fastq=FORWARD_FASTQ,
        in_reverse_fastq=REVERSE_FASTQ,
        in_label=NAME, #later can be plural
        in_cores=CORES,
        in_disk=DISK,
        in_mem=MEM
    }

    call genome_inference {
        input:
        in_container_pangenie=PANGENIE_CONTAINER, # not sure whether Docker needs to be re-run
        in_pangenome_vcf=PANGENOME_VCF,
        in_reference_genome=REF_GENOME,
        in_executable=EXE_PATH,
        in_fastq_file=reads_extraction_and_merging.fastq_file, # how to feed a task output to another one!!!
        prefix_vcf=VCF_PREFIX,
        in_cores=CORES,
        in_disk=DISK,
        in_mem=MEM
    }

    output {
        File sample = reads_extraction_and_merging.fastq_file
        File genotype = genome_inference.vcf_file
    }
}

task reads_extraction_and_merging {
    input {
        String in_container_pangenie
        File in_forward_fastq
        File in_reverse_fastq
        String in_label
        Int in_cores
        Int in_disk
        Int in_mem
    }
    command <<<
        cat ~{in_forward_fastq} ~{in_reverse_fastq} | pigz -dcp ~{in_cores} > ~{in_label}.fastq
    >>>
    output {
        File fastq_file = "~{in_label}.fastq"
    }
    runtime {
        docker: in_container_pangenie
        memory: in_mem + " GB"
        cpu: in_cores
        disks: "local-disk " + in_disk + " SSD"
    }
}

task genome_inference {
    input {
        String in_container_pangenie
        File in_reference_genome
        File in_pangenome_vcf
        String in_executable
        File in_fastq_file
        String prefix_vcf
        Int in_cores
        Int in_disk
        Int in_mem
    }
    command <<<
        echo "vcf: ~{in_pangenome_vcf}" > /app/pangenie/pipelines/run-from-callset/config.yaml
        echo "reference: ~{in_reference_genome}" >> /app/pangenie/pipelines/run-from-callset/config.yaml
        echo $'reads:\n sample: ~{in_fastq_file}' >> /app/pangenie/pipelines/run-from-callset/config.yaml
        echo "pangenie: ~{in_executable}" >> /app/pangenie/pipelines/run-from-callset/config.yaml
        echo "outdir: /app/pangenie" >> /app/pangenie/pipelines/run-from-callset/config.yaml
        cd /app/pangenie/pipelines/run-from-callset
        snakemake --cores ~{in_cores}
    >>>
    output {
        File vcf_file = "~{prefix_vcf}.vcf"
    }
    runtime {
        docker: in_container_pangenie
        memory: in_mem + " GB"
        cpu: in_cores
        disks: "local-disk " + in_disk + " SSD"
        preemptible: 1 # can be useful for tools which execute sequential steps in a pipeline generating intermediate outputs
    }
}

Paste your configuration if possible, MAKE SURE TO OMIT PASSWORDS, TOKENS AND OTHER SENSITIVE MATERIAL: Screenshot from 2022-12-09 10-52-16

Please help me out on how to set the resources used by Cromwell in local, what file I need to create/modify or how should I cange my code? Thanks in advance!

aofarrel commented 1 year ago

I'm not a Cromwell dev, but I've dealt with this quite a lot, so I have some experience here...

When resource issues happen on local-Cromwell, it is usually due to scattered tasks either all running at once (which is the default behavior), or, if they're running one-at-a-time, things getting stuck. But none of your tasks are scattered, so the usual easy fixes don't apply.

Unfortunately, Cromwell ignores most of your runtime arguments when running in "local mode" including memory, cpu, and disk size. This isn't something you can configure, it just doesn't know how to handle them. You'll see warnings to that effect when the tasks launch, eg:

[2022-12-13 12:11:22,26] [warn] LocalExample [5aba40a5]: Key/s [preemptible, disks, cpu, memory] is/are not supported by backend. Unsupported attributes will not be part of job executions.

One thing you can try doing to get around this is to make sure Docker is getting as much memory as you can give it. If you're using Docker Desktop, you can do this in Preferences > Resources, then cranking the memory slider as far to the right as you feel comfortable doing. But I do notice you're using a Linux machine, so it's probably a good idea to be using Docker Engine instead of Docker Desktop if this this issue with the Dockstore CLI, which that uses Cromwell to launch workflows, is any indication, which has a different way of configuring resources.

If you're still having issues, please post a followup -- and others, please chime in too if you have ideas. Resource usage on local runs is a bit of a persistent issue with Cromwell.

Overcraft90 commented 1 year ago

@aofarrel Thanks a lot! This actually worked just fine till I hit the memory wall of 200Gb of RAM. In fact, my Docker launches a tool invoking a Snakemake pipeline for genome inference, the fourth step of which requires 200Gb of memory.

Now, prior to your suggestion my Docker Engine was running with 20Gb of RAM, I then pushed it to 120. This helped to go through the 3rd step of the Snakemake pipeline, requiring 100Gb of memory and where previously my WDL run was terminated, still the entire script cannot complete its execution due to memory requirement.

With that said, I might keep this issue open for a bit longer maybe someone can relate to this and, most importantly, someone without the high memory demand for the tool I'm using might actually get the job done with this simple workaround.