Scatter with globs of large output arrays taking too long in "gather" virtual task

When running the following example wdl task:

task SplitGvcfTouch {
  File interval_list
  String sample_name

  command <<<
    # cut -f1-3 returns <chromosome> <start> <stop>
    cat ${interval_list} | grep -v "@" | cut -f1-3 > regions.txt
    mkdir split_gvcfs
    piece=0
    while read -r chrom start stop; do
      OUT_GVCF="printf ${sample_name}.%04d.g.vcf.gz $piece"
      OUT_GVCF_INDEX="printf ${sample_name}.%04d.g.vcf.gz.tbi $piece"
      touch split_gvcfs/$($OUT_GVCF)
      touch split_gvcfs/$($OUT_GVCF_INDEX)
      piece=$(($piece+1))
    done < regions.txt

  >>>
  runtime {
    docker: "broadinstitute/genomes-in-the-cloud:1.1044_with_gatk4"
    memory: "3 GB"
    cpu: "1"
    disks: "local-disk 50 HDD"
    #preemptible: 3
  }
  output {
    Array[File] gvcf_list = glob("split_gvcfs/*.gz")
    Array[File] gvcf_index_list = glob("split_gvcfs/*.tbi")
  }
}

where SplitGvcfTouch is called like:

  scatter (idx in indexing_list) {
    call SplitGvcfTouch {
      input:
        sample_name = sub(sub(gvcf_list[idx], "gs://.*/",""), ".g.vcf.gz$", ""),
        interval_list = split_interval_list
      }
  }

indexing_list is an array of integers 0-94, sample_name can be any string, and interval_list is attached wgs_split_10000000_tiledb.intervalist.txt

with these inputs, each scattered task should be globbing an array of 901 elements for both gvcf_list and gvcf_index_list

When this is run on JES backend, according to the timing diagram it is taking 25-30 min of "cromwell final overhead" which is much longer than ever previously seen. Once all of the scatter tasks are completed, the implicit gatherer starts but never finishes(at least I haven't seen it finish yet). This task also causes issues when trying to call cache previous results.

broadinstitute / cromwell

Scatter with globs of large output arrays taking too long in "gather" virtual task #820