broadinstitute / cromwell

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments
http://cromwell.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
996 stars 361 forks source link

Scatter with globs of large output arrays taking too long in "gather" virtual task #820

Closed jsotobroad closed 8 years ago

jsotobroad commented 8 years ago

When running the following example wdl task:

task SplitGvcfTouch {
  File interval_list
  String sample_name

  command <<<
    # cut -f1-3 returns <chromosome> <start> <stop>
    cat ${interval_list} | grep -v "@" | cut -f1-3 > regions.txt
    mkdir split_gvcfs
    piece=0
    while read -r chrom start stop; do
      OUT_GVCF="printf ${sample_name}.%04d.g.vcf.gz $piece"
      OUT_GVCF_INDEX="printf ${sample_name}.%04d.g.vcf.gz.tbi $piece"
      touch split_gvcfs/$($OUT_GVCF)
      touch split_gvcfs/$($OUT_GVCF_INDEX)
      piece=$(($piece+1))
    done < regions.txt

  >>>
  runtime {
    docker: "broadinstitute/genomes-in-the-cloud:1.1044_with_gatk4"
    memory: "3 GB"
    cpu: "1"
    disks: "local-disk 50 HDD"
    #preemptible: 3
  }
  output {
    Array[File] gvcf_list = glob("split_gvcfs/*.gz")
    Array[File] gvcf_index_list = glob("split_gvcfs/*.tbi")
  }
}

where SplitGvcfTouch is called like:

  scatter (idx in indexing_list) {
    call SplitGvcfTouch {
      input:
        sample_name = sub(sub(gvcf_list[idx], "gs://.*/",""), ".g.vcf.gz$", ""),
        interval_list = split_interval_list
      }
  }

indexing_list is an array of integers 0-94, sample_name can be any string, and interval_list is attached wgs_split_10000000_tiledb.intervalist.txt

with these inputs, each scattered task should be globbing an array of 901 elements for both gvcf_list and gvcf_index_list

When this is run on JES backend, according to the timing diagram it is taking 25-30 min of "cromwell final overhead" which is much longer than ever previously seen. Once all of the scatter tasks are completed, the implicit gatherer starts but never finishes(at least I haven't seen it finish yet). This task also causes issues when trying to call cache previous results.

kcibul commented 8 years ago

I confirmed with EB that the file used here does not need to be protected, so that should make things easier.

Whoever takes this ticket... if you find it takes too long to run still let me know and we can work together to slim down the use case even further I think (but it may be just fine the way it is)