DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
894 stars 241 forks source link

WDL glob order seems arbitrary and breaks workflows that rely on it being consistent #5009

Closed adamnovak closed 1 month ago

adamnovak commented 2 months ago

The VG WDL workflow has a task that splits a BAM up by chromosome. Then it globs all the resulting BAMs into one array, and all the BAM index files into another array, and expects the two arrays to have the same chromosomes' files in the same order so that corresponding files match:

https://github.com/vgteam/vg_wdl/blob/4f470247d7d8ae2e9e8e50ff5701ef6fab25f19e/tasks/bioinfo_utils.wdl#L490-L491

I ran the workflow on Toil and this didn't happen; I got my two arrays in arbitrary order relative to each other and corresponding files did not match. This random order made it through all the way to the end of my workflow, which called 0 variants because it was never using the right index file for its BAMs.

{
  "GiraffeDeepVariant.output_calling_bam_indexes": [
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr8.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr3.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr19.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr16.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr22.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr21.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr13.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr7.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr17.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chrY.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr5.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr11.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr14.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chrX.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr1.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr20.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr10.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr18.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr4.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr12.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr9.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr15.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr6.bam.bai",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr2.bam.bai"
  ],
  "GiraffeDeepVariant.output_calling_bams": [
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr5.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr18.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chrY.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr12.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr8.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr10.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr11.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr16.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr9.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr21.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr19.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr7.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr4.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr22.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chrX.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr6.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr17.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr15.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr2.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr3.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr14.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr1.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr13.bam",
    "output/wdl-calls/045a7e47-d56d-4fcc-809f-c710bb9dee26/HG002.chr20.bam"
  ],
...

We should probably make sure that glob order is consistent. I think we're meant to work exactly like Bash here, and I think Bash alphabetizes glob results.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1613

adamnovak commented 2 months ago

Maybe we need to find a way to actually invoke the contained bash to do the glob, like the spec says we need to do? I have no idea how they think implementations are supposed to be able to accomplish that.

adamnovak commented 2 months ago

OK, the problem is that we do invoke Bash (though not in the container) to do the globbing, but the spec says the glob should be interpreted as if part of echo <glob>, while we use compgen -G to let us quote the glob and get newline-delimited (instead of space-delimited) results. But the two ways of getting Bash to do the glob don't produce results in the same order.