amkozlov / raxml-ng

RAxML Next Generation: faster, easier-to-use and more flexible
GNU Affero General Public License v3.0
376 stars 62 forks source link

enhancement suggestion: use 'rapidnj' for starting trees #89

Open roblanf opened 4 years ago

roblanf commented 4 years ago

This is just a suggestion based on some observations of estimating large (>10K sequences) trees. I note that it takes raxml-ng a long time to estimate the starting parsimony tree (I let it run for an hour then killed it). Of course, I could use a random tree but that probably makes the later optimisation impractical.

On the same data, I was able to estimate a surprisingly good tree with rapidnj (https://github.com/johnlees/rapidnj) in ~3 minutes on one not-very-fast CPU. More details are here: https://github.com/roblanf/sarscov2phylo/blob/master/tree_estimation.md

So I thought I'd mention it here in case it is useful. Perhaps for datasets above a certain number of tips one could switch to rapidnj starting trees, either with 10 bootstrap trees or 10 trees that are the rapidnj tree plus nine other that are one SPR away for the 10 starting trees.

And thanks for the excellent software!

amkozlov commented 4 years ago

Hi Rob, thanks for your suggestion. we will definitely consider adding NJ starting trees as an option! Parsimony is not parallelized in raxml-ng, which is one of the reason it is quite slow on large datasets.

Just a side note: we are also working with this virus data (surprise :) ), so it was very interesting to read about your experiments. I'm wondering, however, why do you want to keep duplicate sequences in your analyses? After removing identical sequences (and doing some filtering), we end up with a dataset of <5K sequences, on which we can run full raxml-ng tree search in "just" a few hours.

roblanf commented 4 years ago

That's a good question. I only keep the identical sequences in IQ-TREE, because the code in for removing them in IQ-TREE is not well optimised.

I don't keep them in raxml-ng though. Having said that I still end up with ~7K sequences (using the latest data from GISAID, and trimming as in my repo). What other filtering are you doing?

raxml-ng is running fine (using the fasttree tree, which I made into a bifurcating tree first, as the starting tree), but still is fairly slow (7 hours so far, and on iteration 5).

My CPUs aren't great though (2.4GhZ I think) so maybe that's part of the problem!!