genome / analysis-workflows

Open workflow definitions for genomic analysis from MGI at WUSM.
MIT License
102 stars 57 forks source link

Revisit BAM indexing in the workflows #738

Open tmooney opened 5 years ago

tmooney commented 5 years ago

As mentioned in comments on #698, the issue can arise if the input step already yields an indexed file. There's no reason for us to run the indexer again when the input is already indexed, so really we should remove that step entirely in situations where it is.

CWL Conditional support, when it is released, may make worrying about this less complicated, but it'd still be more efficient to not do anything when we know an index exists 😄

chrisamiller commented 4 years ago

Am I understanding correctly here that this could be fixed with the following approach?

a) pass optional secondary files into the index step b) add a little bash script to the index file that says "run samtools index if the index doesn't exist"

The index step will still get called, which is inefficient, but at least it won't duplicate work

This same approach could be considered for VCF indexing steps.

chrisamiller commented 4 years ago

I spent some time looking into this, and it turns out to be difficult to accomplish without conditionals, due to the fact that the tool needs to know whether secondary files are being passed in or not. (they can't be optional, or at least, cromwell doesn't support that). Removing this from the 2.0 milestone and we can revisit if cromwell updates