agormp / phylotreelib

Library for analyzing and manipulating phylogenetic trees
GNU General Public License v3.0
11 stars 1 forks source link

spr() from bifurcating tree leads to non-bifurcating tree #2

Closed EDohmen closed 1 year ago

EDohmen commented 1 year ago

Hi, thanks for the work and the lib. I have a problem with the spr (subtree pruning and regrafting) function, which I would like to use to create randomly different topologies of a strictly bifurcating tree. If I take as the source node a leaf and not an internal node this leads to internal nodes being non-bifurcating (just one additional internal node between the original internal node and the leaf) and this can lead in newick representation even to unmatching opening and closing parentheses. Is it not possible to prune leaves or is there a workaround how I can remove these introduced non-bifurcating internal nodes?

Also, I find it quite difficult/verbose to define all conditions to filter possible regrafting nodes after I picked (randomly) a pruning node from the tree (which doesn't lead to an Error due to being a (remote - also internal) child of the same node or a parent or root or the other children of the parent node...). Is there maybe a function (planned) which makes this easier and gives potential options to do random spr moves on a tree?

Thanks a lot in advance and again thanks for your work with this library! Elias

P.S.: I wrote for this some helper functions like a recursive "_get_all_remote_children()" function to get all remote child nodes, not just remote child leaves or an "_is_rotation_variant()" to check for rotation variants of two trees (or is this already solved by comparing the (sorted) newick representations?) - if these are ideas for future additions to the lib and you are interested, I'm happy to share.

agormp commented 1 year ago

Thanks again for the interest and sorry about the delay!

Regarding bug in spr() method when using on single leaf: Thanks for pointing this out! This should now be fixed in version 1.18.3 (available on PyPi and GitHub). Please let me know if there are still issues. (For one thing, I am not currently dealing intelligently with branch lengths, but topology aspects should be ok).

Regarding possible regrafting nodes: Hmm - good point! If I understand correctly the issue is that, after choosing a node where you want to prune, you need a list of the possible locations (nodes) where the pruned subtree can subsequently be regrafted? (This means excluding all nodes that are on the pruned subtree, and also excluding the internal node where the pruned subtree was originally attached). Alternatively (or additionally) you would like a function that automates this and does a random SPR move? Let me just think about this and get back.

Also thanks for offering to contribute your functions! Regarding "_is_rotation_variant()": This should already be covered by existing methods: The __eq__() method, for instance, allows to check for tree equality using:

tree1 == tree2

This checks that the two trees have the same unrooted topology (regardless of the Newick representation) but also checks that all branch lengths are (approximately, within typical precision) the same. An alternative is to explicitly check for equality of just the tree topologies (so ignoring branch lengths):

tree1.topology() == tree2.topology()

The .topology() method returns a set-based representation of the topology: Names of leaves on one side of a branch are represented as an immutable set. A bipartition (= internal branch in the unrooted tree) is represented as an immutable set of two such (complementary) sets. The entire tree topology is represented as a set of bipartitions. Thus, two trees with the same sets of bipartitions will have the same output from .topology(), regardless of how their original tree strings looked.

agormp commented 1 year ago

Regarding random spr(): I have now updated the library to allow random spr() moves (version 1.19.1, available on PyPi and GitHub). You can choose between the following ways to run the method (assuming you have a Tree object named "mytree"):

  1. mytree.spr(): when neither prune_node (previously called "subtree_node") nor regraft_node are specified, the method will choose both randomly
  2. mytree.spr(prune_node = "Bonobo"): when only the prune_node is specified, the method will choose the regraft_node randomly. (Note that the prune_node can be either a leaf or an internal node)
  3. mytree.spr(prune_node=5, regraft_node="Gorilla"): it is still possible to specify both prune_node and regraft_node. In this case the method will check if regraft_node is compatible with prune_node (i.e., whether it is on the part not removed after pruning the subtree), and raise a TreeError exception if not.

If you want to manually find the set of possible regraft_nodes corresponding to a given prune_node, then the recipe below should work. It is here assumed you have a Tree object named "mytree" and that "prune_node" has been assigned some value (either an internal node ID, or the name of a leaf):

remote_nodes = mytree.remote_nodes(subtree_node)
possible_regraft_nodes = mytree.nodes - remote_nodes - {mytree.root}
subtree_parent = mytree.parent(subtree_node)
if mytree.is_bifurcation(subtree_parent):
    possible_regraft_nodes = possible_regraft_nodes - {subtree_parent}

(EDIT: that recipe doesn't work. Also working on removing final bugs. Stay tuned...)

Kind regards, Anders

agormp commented 1 year ago

I fixed one remaining bug in the spr() method, which should now be working as explained above. The updated library is available as version 1.20.2 on PyPi and GitHub.

As mentioned you can use the spr() method in three different ways:

  1. Let the method choose random values for prune_node and regraft_node: mytree.spr()
  2. Specify prune_node and let the method choose a random value for the regraft_node: mytree.spr(prune_node=n1)
  3. Specify both prune_node and regraft_node: mytree.spr(prune_node=n1, regraft_node=n2). A TreeError exception will be raised if the regraft_node is incompatible with the prune_node.

If you want to manually choose a regraft_node among those that are compatible with a specific prune_node (perhaps an unlikely scenario now...), then you can use the new method here to get a list of possible values:

mytree.possible_spr_regraft_nodes(prune_node)

Closing for now, but let me know if there are any issues.

Kind regards, Anders

EDohmen commented 1 year ago

Thanks a lot for all the work and the quick fixes, I will try the new spr() method(s), but I like a lot already the clarity and consistency in how to use the function for the different purposes! Cheers, Elias