Prioritize MSAs for gene families used in species tree inference

austinhpatton commented 1 year ago

This is a small change to force prioritization of MSA inference for gene families we'll use to infer the species tree. Main reason for this is just that (particularly when deployed on tower) I would like to make sure that we first put efforts into making headway towards species tree inference, as that rooted tree will be used in gene-tree species tree reconciliation on the "remainder" set of gene families which are, by design, quite large and time consuming.

I found that because the two subsets are in large part asynchronous, they could hold up MSAs and gene tree inference for those trees used for the species tree, creating a counterproductive bottleneck.

The way I forced that priority was to use the collect() operator on the spp tree gene family MSAs - but, because I don't want to stage these (potentially thousands of) files, I then use (for purely practical reasons) the count operator and pass that as an optional variable to the MAFFT module.

Happy to hear if that's a dumb way to do it or if there's a more intuitive solution! I just figured passing a single number would be easier and not particularly impactful, even if the variable is effectively meaningless.

mertcelebi commented 1 year ago

I am a little confused about this PR, are you trying to prioritize the MAFFT calls over any of the MAFFT_REMAINING calls ie "finish all the MAFFT processes, then finish all the MAFFT_REMAINING processes?

If so, there probably is a less hacky way of doing this, which may be worth googling about.

austinhpatton commented 1 year ago

I am a little confused about this PR, are you trying to prioritize the MAFFT calls over any of the MAFFT_REMAINING calls ie "finish all the MAFFT processes, then finish all the MAFFT_REMAINING processes?

If so, there probably is a less hacky way of doing this, which may be worth googling about.

Yeah, I realize this is probably a bit out of left field as we haven't discussed it, but is something I've been wanting to do for a while now (in some form or another).

Happy to chat about it in more detail, but your description is exactly right. Basically the thinking is that we are just trying to give the orthogroups used in species tree inference (MAFFT -> IQTREE -> ASTEROID -> SPECIESRAX) a leg up over those involved only in gene-tree species tree reconciliation (MAFFT_REMAINING -> IQTREE_REMAINING -> GENERAX). That's because the latter is comprised of larger (and more time consuming) gene families, and the last module, GENERAX, requires the species tree inferred from SPECIESRAX.

So, I just want to be sure that we start chugging along efficiently with the species tree "track" before investing effort into the larger gene families which need to wait for the species tree regardless. From my understanding, nextflow doesn't have any inherent ways to give "priority" to different asynchronous tasks, so my solution was just to force the process to wait - certainly may still be better alternatives!

Arcadia-Science / noveltree

Prioritize MSAs for gene families used in species tree inference #39