Cibiv / IQ-TREE

Efficient phylogenomic software by maximum likelihood
http://www.iqtree.org
GNU General Public License v2.0
183 stars 44 forks source link

Is it possible to insert new sequences into existing phylogenetic tree? #147

Open boxiangliu opened 4 years ago

boxiangliu commented 4 years ago

I am wondering if it is possible to insert a new sequence into an existing phylogenetic tree.

For instance, I have an existing tree with 500 nodes. Now I have 4 more sequences. It seems very inefficient to rebuild the entire tree. Rather, there should be an option to insert these new sequences into the existing tree.

I searched on BioSTAR for relevant answers. Several packages like pplacer and treebest have this insertion functionality (https://www.biostars.org/p/3490/#3499). I am wondering if this is already implemented in iqtree?

Boxiang

boxiangliu commented 4 years ago

Based on the reply from @roblanf, here is a way to add new sequences to existing alignment.

Let’s say you have a tree called tree1.tree with 96 sequences from aln1.fa. Now you’ve added your four new sequences to that alignment to get a 100 sequence alignment aln2.fa.

To add the new sequences to the tree, just do this:

iqtree -s aln2.fa -g tree1.tree

This will use tree1.tree as a constraint tree, and just add the 4 new sequences in the best place possible without changing any of the relationships in tree1.tree, giving you aln2.tree. This will often be good enough.

But let’s say you’re concerned that adding the 4 new sequences may change some of the relationships in tree1.tree. Simply re-optimise aln2.tree to account for this:

iqtree -s aln2.fa -t aln2.tree

That’s it!

As @roblanf suggested in our email chain, it might be useful to combine these two options into a simpler one-command option. I'd be keen to discuss this!

roblanf commented 4 years ago

Here's my suggestion for what would seem natural from a users perspective.

A single --update command which just adds to -t (starting tree) as follows:

iqtree -s aln.fa -t start.tree --update

There are a few possibilities for taxon sets from the alignment and the tree:

  1. Taxa are in the alignment and the tree - keep them (of course)
  2. Taxa are in the tree but not the alignment - prune them from the tree
  3. Taxa are in the alignment but not the tree - add them using start.tree as a constraint

Following these steps, the run could proceed according to the rest of the commandline, just using the tree passed via -t (possibly with the added new taxa) as the starting tree for the tree search.

In one case where the taxa in the alignment and the tree match perfectly, the --update would be superfluous, i.e. it's just the same as specifying a starting tree via -t. I don't think this matters.

Thoughts?

roblanf commented 4 years ago

We had a meeting about this yesterday, so I wanted to take some notes here.

Given an input tree that could have multifurcations, and an alignment that has all of the sequences in the input tree plus some additional sequences, there are three stages to the algorithm we need.

  1. Placement
  2. Local refinement
  3. Global refinement

Here are some options for these steps.

  1. Placement (i.e. finding a good branch on which to place new sequences) IQ-TREE already contains the code to do placement with Maximum Parsimony. But this could also be done with ML following the Evolutionary Placement Algorithm. Currently IQ-TREE does random sequential placement. This could be replaced by parallel placement with the placements combined in a final step.

  2. Local refinement Obviously given a branch placement, one needs to then figure out the ML position of a sequence on that branch. But in addition James Barbetti suggested that following each new sequence around using local NNIs seems to be very effective. I.e. one doesn't change the underlying tree, but you keep trying neighbouring branches for the local sequence in ML space until it gets to a local optimum. For this stage we figure one can keep the model parameters identical to those that were estimated previously.

  3. Global refinement After adding sequences, it seems sensible to then do some global refinements. IQ-TREE of course has all the machinery for this. A sensible default might be to simply do do whatever IQ-TREE usually does as the default when you have a starting tree, i.e. to do the stochastic algorithm. But users should also have the options to speed things up e.g. by using the -fast option to just do a couple of local rearrangements. This part should, of course, also involve refining the model parameters.

roblanf commented 4 years ago

Another point from another meeting. James has a strong intuition that if we add too many new sequences at once, we'll make life difficult for ourselves. Specifically, he thought that an upper limit of around the square root of the number of sequences currently in the tree might be sensible.

Of course, this can be tested empirically for any given dataset. Something like this.

Start with a tree of N taxa, with another N taxa to add to it.

Use the following algorithm, for different values of K:

  1. Place K sequences
  2. Local refinement
  3. Global refinement
  4. Go to 1 until all N sequences are placed

E.g. if we start with 10K sequences, we could pick K to be 1, 10, 100, 1000, 2000, 5000, 10000.

We then compare all of those trees to each other, and to the tree that we optimise by doing all 20K sequences (N+N) from scratch with standard IQ-TREE parameters.

It seems likely that at the lower end the 20K sequence tree could be a lot better than the one-shot 20K sequence tree, simply because we'll end up doing a lot of global NNIs. But it's an interesting and open question to see if this is true, and if it is true, where the tradeoff is.

JamesBarbetti commented 4 years ago

I haven't experimented with adding sequences to taxa trees as such, but back in May, before I started work, I experimented with adding sequences to unrestricted Steiner trees (and bifurcated trees) in euclidean 2-space. As nodes were added to the tree, they quickly "messed up" local subtrees; so new nodes landing those "messed up" subtrees weren't well-placed. Additional tidy-up processing after the addition of each node helped, but didn't prevent the problem (I tried "no" tidy-up, local NNI, global SPR, and even global TBR moves, and combinations). global SPR moves (and even more so) global TBR moves really helped a lot but the cost of doing those (that frequently) in a proper taxa tree would be prohibitive! I didn't "have the machinery" to try "localised" SPR moves or "localised" TBR moves. Those are options that might be worth considering.

roblanf commented 4 years ago

One option in IQ-TREE (since it's not built for SPR or TBR moves) would be addition via MP (already implemented) followed by local NNI moves that follow each sequence around until it finds an ML optimum position. I.e. where we move the sequence along an NNI chain, but never change the underlying tree to which is is being added.

I could imagine (but not code!! hah) doing this in parallel for N sequences, then combining them all into a single tree and doing another set of local NNIs on any disrupted branches in that combined tree.

Separating out the local (by which I mean NNI moves that don't change the background tree to which the N sequences are being added) from the global NNI moves would be really useful, because sometimes there's good reason not to want global moves at all (e.g. if the sequences we are adding are a bit dodgy, and we worry they might mess up the global tree all together).

roblanf commented 4 years ago

A question - @minh and @JamesBarbetti, how long do you think it would take to implement each of these steps:

  1. Local NNI moves for added sequences
  2. An ML version of the sequence-addition algorithm
  3. Adding N new sequences in parallel (either with the MP+local NNI, or with the ML sequence addition), followed by bringing them all into a single tree and doing local NNIs on all disrupted branches

?