davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
692 stars 187 forks source link

Splitting up reconciliation step into multiple runs #241

Open brendane opened 5 years ago

brendane commented 5 years ago

I'm running OrthoFinder 2.2.7 on about 400 bacterial genomes. It has been working well, but the final step (starting with a species tree and orthogroups using the -fg and -s option) only gets through about 1000 out of 30000 orthogroups in 96 hours when running on 4 cores (-a 4) with 62 GB of memory.

96 hours is the longest time allowed for the standard queue on HPC system I'm using. While there are options for longer run times, I'm reluctant to commit to a potentially very long batch job if there is a way to break the work up into multiple jobs.

So, I'm wondering if there is a way to break the orthogroups into several sets and run the reconciliation algorithm on each set separately?

I'm using the dendroblast method, not using multiple alignments, and gene tree construction seems to take less than 24 hours. I am also using the binary version of OrthoFinder and running it on CentOS 7.5.

Thank you.

davidemms commented 5 years ago

Hi Brendan

It's a good question, I've had trouble with this one myself in the past too. I've not tackled it yet as, for your 400 genomes, Orthofinder will be writing out 400x400 = 160,000 orthologues files. And at the same time each orthogroup (which would be parallelised over) could have pairs of orthologues from each species pair, meaning each task would have to write to each file! It's not insurmountable but will involve careful managing of inter-dependent parallel tasks. It's a good reminder though, I'll have a look and see how much work it'd be.

All the best David