PoonLab / Kaphi

Kernel-embedded ABC-SMC for phylodynamic inference
GNU Affero General Public License v3.0
4 stars 2 forks source link

MAST is slow #128

Closed ArtPoon closed 6 years ago

ArtPoon commented 6 years ago

I'm pretty certain we can use dynamic programming to speed up calculation of this recursive function

mavino commented 6 years ago

yes and the problem was tested with the same trees as #129

mavino commented 6 years ago

the must.py script from Kuhner and Yamato is much faster than the one in Kaphi and gives different results

ArtPoon commented 6 years ago

Here's the mast.py script from Kuhner and Yamato:

import ete2
from ete2 import Tree
from ete2 import TreeNode
import copy

# this is a function for use elsewhere, not stand-alone
def mast(intree1,intree2):
  # copy the trees--this function DESTROYS its working copy!
  tree1 = copy.deepcopy(intree1)
  tree2 = copy.deepcopy(intree2)
  # count the tips
  numtips = len(tree1.get_leaves())
  assert numtips == len(tree2.get_leaves())  # trees must be same size!!

  ### label nodes

  # first, relabel tips as 0..numtips-1.  This has to be done all at
  # once as there might already have been tips with those names

  tipnames = tree1.get_leaf_names()
  tipdict = {}
  tipno = 0
  for tip in tipnames:
    tipdict[tip] = tipno
    tipno += 1
  for leaf in tree1.get_leaves():
    leaf.name = tipdict[leaf.name]
  for leaf in tree2.get_leaves():
    leaf.name = tipdict[leaf.name]

  # then, label internal nodes, starting with tips+1
  nodeno = numtips
  for node in tree1.traverse("postorder"):
    if not node.is_leaf():  
      node.name = nodeno
      nodeno += 1
  nodeno = numtips
  for node in tree2.traverse("postorder"):
    if not node.is_leaf():
      node.name = nodeno
      nodeno += 1

  # create the 2D array
  numnodes = 2 * numtips - 1
  vals = []
  inner = []
  for i in range(0,numnodes):
    inner.append(None)
  for i in range(0,numnodes):
    vals.append(inner[:])     # make a copy!

  # the mast score is the number of tips in the MAST of the given subtrees
  def rmast(a,w):   # recusively calculate mast score
    if vals[a.name][w.name] != None:
      return vals[a.name][w.name]    # already calculated
    else:
      # leaf branch
      if a.is_leaf():
        if a.name in w.get_leaf_names():
          vals[a.name][w.name] = 1
          return 1
        else:
          vals[a.name][w.name] = 0
          return 0
      if w.is_leaf():
        if w.name in a.get_leaf_names():
          vals[a.name][w.name] = 1
          return 1
        else:
          vals[a.name][w.name] = 0
          return 0
      # non-leaf branch
      b = a.get_children()[0]
      c = a.get_children()[1]
      x = w.get_children()[0]
      y = w.get_children()[1]
      results = []
      results.append(rmast(b,x)+rmast(c,y))
      results.append(rmast(b,y)+rmast(c,x))
      results.append(rmast(a,x))
      results.append(rmast(a,y))
      results.append(rmast(b,w))
      results.append(rmast(c,w))
      vals[a.name][w.name] = max(results)
      return vals[a.name][w.name]

  biggest = 0
  for node1 in tree1.traverse("postorder"):
    for node2 in tree2.traverse("postorder"):
      rmast(node1,node2)

  biggest = 0
  for i in range(0,numnodes):
    for j in range(0,numnodes):
      if vals[i][j] > biggest:
        biggest = vals[i][j]
  return biggest

class Mast:
  # NOTE that we make this a distance when it is natively a similarity!
  def __init__(self,storage):
    self.results = copy.deepcopy(storage)
  def doMetric(self,tree1,tree2,treesize):
    return treesize -mast(tree1,tree2)
  color = "m-"
  name = "MAST"
mavino commented 6 years ago

when a tree is compared with itself, MAST from Kaphi gives me value=0, MAST from the phyton script will give me value=17

mavino commented 6 years ago

The last experiment was made with a tree of 17 tips

ArtPoon commented 6 years ago
ArtPoon commented 6 years ago

@helenhe96, you can find documentation about ete2 here: http://etetoolkit.org/docs/2.3/reference/reference_tree.html

helenhe96 commented 6 years ago

MAST in Kaphi and the Python script are producing the same results. I think @mavino might be missing the part of the code (in the Python script) where mast(tree1,tree2) is being subtracted from treesize.

mavino commented 6 years ago

that's right, I missed the last part, thanks and sorry

ArtPoon commented 6 years ago

@helenhe96 identified a potential problem - there are nested loops in our implementation of MAST that is iterating over nodes in the default ordering of ape::phylo objects. We should change this to postorder traversal as in the Python implementation. See this example.

ArtPoon commented 6 years ago

See get.tip.labels() function in this thread

helenhe96 commented 6 years ago

MAST benchmarking (in seconds)

5-tip-tree 20-tip-tree 50-tip-tree 100-tip-tree
Python 0.00131988525391 0.0170969963074 0.107938051224 0.447192907333
R, old version (2b6245b) 0.275 1355.131 / /
R, new version (3eb64e8) 0.059 1.503 10.675 55.743
R, newer version (4879323) 0.008 0.069 0.314 1.001