flass / pantagruel

a pipeline for reconciliation of phylogenetic histories within a bacterial pangenome
GNU General Public License v3.0
46 stars 7 forks source link

MrBayes (step 06) extremely slow? #38

Closed MartinezRuiz-Carlos closed 4 years ago

MartinezRuiz-Carlos commented 4 years ago

I am running Pantagruel on my local computer, as unfortunately the HPC install proved quite fiddly. So far, all steps have run with no problem up until step 06. I am aware this step is recommended to be run on an HPC, so I assumed it would take a while on a local machine. It has run now for a week and the directory 06.gene_trees/fullgenetree_mrbayes_trees/nocollapse contains 370 directories in it. Do I understand correctly that step 06 should generate a gene tree for all genes in cdsfams_minsize4? Because in that case MrBayes needs to go through 21,477 genes, and at the current rate it is not really practical, even if I were to parallelise it on several cores in an HPC. My question then is, is there a way I can accelerate this step? Perhaps being more stringent with the set of core genes? I am currently running it on 298 genomes, with 89 core genes. As recommended in the manual I am running step 06 with the option --collapse enabled. Thanks!

flass commented 4 years ago

Hi Carlos,

Thank you for your interest in Pantagruel!

Indeed the MrBayes step can be very heavy, especially if you have many diverse taxa in your species tree. In my experience, it remains quite fast (a few minutes per gene tree) up to ~100 taxa from a same bacterial species or genus; after that mark it starts to be quite intensive to compute gene trees, at least for the largest gene families. That is where the HPC implementation is favoured, with each family running on an independent subjob (within an array job), so that large families won't delay the computation of the rest of the dataset. With the local default implementation, families are run in parallel, with 1 family per thread. One possible problem is if you have say 8 large families running in parallel and not much memory available, the concurrent MrBayes processes may cause an overload of your machine's memory and cause it to swap and thus dramatically slow down. But even without that, you can imagine that it may be slow compared to computing the gene trees on a HPC cluster, where each family would be typically run on a 8-core node. In general, running Pantagruel on a laptop or desktop can be problematic because MrBayes runs are not lightweight computation, and to make it worse the following step of ALE reconciliation can be even worse (long, and potentially very intensive in memory use).

But a solution is out there! I was glad to recently integrate a new tool called GeneRax to Pantagruel, in the usingGeneRax branch. GeneRax does the combined job of both MrBayes and ALE, i.e. gene tree reconstruction and reconciliation. GeneRax is based on the same model as ALE, but with an efficient maximum-likelihood inference of scenarios (instead of Bayesian), and importantly it co-estimates the gene tree topology and the reconciliation scenarios. In short, it's efficient and fast, and also it is devoid of systemic bugs present in ALE (see https://github.com/ssolo/ALE/issues/16). For these reasons I aim to make the usingGeneRax branch the main branch of Pantagruel; it's not yet the case because I am still working on some of the post-processing scripts for tasks 08 and 09, but it is already possible to run the pipeline up to task 07 i.e. to compute the reconciled gene trees. I'll soon be pushing the code for the downstream analyses, and should actually be making a first release for that occasion.

To switch to the usingGeneRax pipeline, you first need to install GeneRax, which is easy. I suggest compiling it using the provided install.sh script (it always worked fine for me) or you can use the GeneRax bioconda package; or alternatively there is the Pantagruel dependency Docker image, which provides all the supporting software required by Pantagruel, available from Dockerhub with:

docker pull flass/pantagruel-dep:usingGeneRax-latest

Then you need to switch the branch of the Pantagruel pipeline code, and update your Pantagruel database configuration file to reflect this change:

# update pantagruel
cd /your/local/repo/pantagruel
git checkout usingGeneRax
git pull --recurse-submodules

# update your database
pantagruel -i /your/PantagruelDatabase/environ_pantagruel_PantagruelDatabase.sh --refresh -e GeneRax init

Then, if you already have computed the full RAxML gene trees and their collapsed version, you should be able to run the task 07 directly:

pantagruel -i /your/PantagruelDatabase/environ_pantagruel_PantagruelDatabase.sh 07

But if you want to be sure everything is in order at the level of genet trees (task 06), you can first run:

pantagruel -i /your/PantagruelDatabase/environ_pantagruel_PantagruelDatabase.sh -R 06

I hope this helps! please let me know if you encounter issues.

Florent