PoonLab / Kaphi

Kernel-embedded ABC-SMC for phylodynamic inference
GNU Affero General Public License v3.0
4 stars 2 forks source link

Incorporate tree shape statistics into SMC-ABC #41

Closed ArtPoon closed 7 years ago

ArtPoon commented 7 years ago

We need to evaluate the kernel method against tree shape statistics and to provide the option to use any combination of such statistics as a similarity measure. It would be nice to use R's model specification format as in glm for this, i.e.: ~ colless + cherries + kernel

ArtPoon commented 7 years ago

@rmcclosk already implemented several of these statistics in treestats.c. To avoid further duplication of functionality, we should bring in another R package like phyloTop.

ArtPoon commented 7 years ago

Discussed this at dev meeting. I think the way forward is to require the user to specify an R expression as a string that represents a linear combination of functions for processing two trees x and y:

"kernel(x, y, gauss=1.0, sigma=1.0) + 2.7 * (sackin(x) - sackin(y)) + 0.0017 * (tree.width(x) - tree.width(y))"

where this expression would be evaluated in a generic distance function. This function would have to throw an exception if one or more of the functions in the expression are not available (for example, if the corresponding R library was not loaded). The expression should be specified as an argument to an smc.config object under keyword dist.

The advantage of this approach is that the user is able to use whatever tree shape metrics or distances they want to. The disadvantage is that the user can use whatever metrics and distances they want to.

ArtPoon commented 7 years ago
gtng92 commented 7 years ago

Tree shape / distance functions in some R packages

Package Tree Stats
ape
  • balance
  • cherry
  • cophenetic.phylo
  • dist.nodes
  • dist.topo
  • gammaStat
    phyloTop
    • avgLadder
    • cherries
    • colless.phylo
    • getDepths
    • ILnumber
    • maxHeight
    • nodeDepth
    • nodeImb
    • pitchforks
    • sackin.phylo
    • stairs
    • widths
      phangorn
      • cophenetic.networx
        apTreeshape
        • colless
        • sackin
        • shape.statistic
          MathiasRenaud commented 7 years ago

          For the task of writing a function that takes 2 trees and applies the dist expression that was parsed in Tammy's function, I am going to use the existing function, distance(t1, t2, config), and update it to use dist rather than kernel. I believe that doing so would fulfill the requirements of the function and also work toward stripping out the hard-coded kernel distance.

          ArtPoon commented 7 years ago

          Provide two methods for specifying a distance function:

          1. Write a detailed yaml entry as per @gtng92 's implementation, which is parsed and converted into an R expression
          2. Write an R expression to be evaluated with eval(parse()) - the expression should define a linear combination of R functions where each function returns a distance The first method would be more useful for batch processing, but the second should be more user-friendly.

          Bear in mind that the kernel function is a similarity measure and we should take the complement (1-k) to get a distance.

          MathiasRenaud commented 7 years ago

          @ArtPoon, is the user expected to explicitly write "kernel(x,y) + (sackin(x) - sackin(y))" or just specify which functions: "kernel(x,y) + sackin(x)". The second case would then require parsing the function from the expression and returning the expression with (sackin(x) - sackin(y)).

          ArtPoon commented 7 years ago

          The user is expected to write out the full expression:

          kernel(x,y) + (sackin(x) - sackin(y))

          because there are potentially many approaches to utilizing sackin other than the difference.

          ArtPoon commented 7 years ago

          The limitation of yaml is that we can't specify arithmetic operations among tree shape statistics like we could with an R expression. We have to assume that if the statistic/function takes a single tree argument, then the distance term is the difference between x and y tree arguments.

          MathiasRenaud commented 7 years ago

          @gtng92 and I have implemented functions so that users can specify their choice of tree shape statistics. The main problem we still have is that certain tree stats functions are not compatible.

          For example, the output of ape::cophenetic.phylo is a matrix of the pairwise distances between each tip in the tree. If the two trees being compared do not have the same number of tips it causes an error. If the trees do have the same number of tips, the entire output of the distance function is coerced to a matrix.

          Another case is that some functions (e.g. ape::gammaStat, Kaphi::pybus.gamma) return NA or NaN (if the given trees do not meet some criteria, such as not enough tips), which causes the final distance to become NA. This should be an easy enough fix to check for instances of NA and remove them from the distance measure.

          At this point, it works with most of the Kaphi tree stats functions since most take one tree and return a numeric value.

          ArtPoon commented 7 years ago

          Okay, please record these problems as separate issues. Thanks!

          On Jul 14, 2017, at 11:36 AM, Mathias Renaud notifications@github.com wrote:

          @gtng92 and I have implemented functions so that users can specify their choice of tree shape statistics. The main problem we still have is that certain tree stats functions are not compatible.

          For example, the output of ape::cophenetic.phylo is a matrix of the pairwise distances between each tip in the tree. If the two trees being compared do not have the same number of tips it causes an error. If the trees do have the same number of tips, the entire output of the distance function is coerced to a matrix.

          Another case is that some functions (e.g. ape::gammaStat, Kaphi::pybus.gamma) return NA or NaN (if the given trees do not meet some criteria, such as not enough tips), which causes the final distance to become NA. This should be an easy enough fix to check for instances of NA and remove them from the distance measure.

          At this point, it works with most of the Kaphi tree stats functions since most take one tree and return a numeric value.

          — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

          MathiasRenaud commented 7 years ago

          Closing since remaining bugs have their own issues.