PoonLab / Kaphi

Kernel-embedded ABC-SMC for phylodynamic inference
GNU Affero General Public License v3.0
4 stars 2 forks source link

Some tree statisitic functions have incompatible outputs #87

Closed MathiasRenaud closed 7 years ago

MathiasRenaud commented 7 years ago

For example, the output of ape::cophenetic.phylo is a matrix of the pairwise distances between each tip in the tree. If the two trees being compared do not have the same number of tips it causes an error. If the trees do have the same number of tips, the entire output of the distance function is coerced to a matrix.

ArtPoon commented 7 years ago

Cophenetic is not going to work as a distance metric unless the trees have the same labels (including the number of tips). The reason is that it is otherwise not clear how to align the two matrices (shuffle rows and columns). I think a reasonable approach for comparing two matrices given this condition is met is to use a matrix correlation coefficient.

ArtPoon commented 7 years ago

To elaborate on the second problem, some tree shape functions implemented in other packages do not return a scalar value, but instead a structured value such as a list. As a result, the number we want would be accessed by result$estimate, for example.

Options:

  1. Include these other packages as dependencies and write wrapper functions for each one.
  2. Require user to specify inline function to extract distance measure, e.g., ...+sackin(t1)$est +...
  3. Make our own implementations (as we have in treekernel)

All three options suck.

ArtPoon commented 7 years ago

Another possibility is to parse the R expression (distance specification string) and add the inline operations required to extract scalar quantities from each term. For example:

expr <- "sackin(x)-sackin(y)"  # original expression
expr2 <- "sackin(x)$val-sackin(y)$val"  # parsed expression

assuming that sackin returns a list including label val.

This would also provide the opportunity to simplify the distance specification string, so the user only has to name sackin and we would add the absolute difference on trees x and y.

ArtPoon commented 7 years ago

Add packages as prereqs.

MathiasRenaud commented 7 years ago

Updated Tree Stats Functions:

Valid Distance Metrics

Tree Statistic Package Description
tree.kernel* Kaphi The kernel distance between two trees.
nLTT Kaphi Does not currently work
sackin Kaphi Sackin index.
colless Kaphi Colless imbalance number.
cophenetic Kaphi Does not currently work
ladder.length Kaphi Max ladder length.
IL.nodes Kaphi Number of internal nodes with one leaf.
tree.width Kaphi Max width divided by max depth.
max.delta.width Kaphi Max difference in in width between two levels.
n.cherries Kaphi Number of cherries(node with two leaves).
prop.unbalanced Kaphi Proportion of unbalanced subtrees.
avg.unbalance Kaphi Average ratio of unbalanced subtrees.
pybus.gamma Kaphi Pybus' gamma statistic
internal.terminal.ratio Kaphi Ratio of internal to terminal branches.
balance** ape For each node of the tree, the numbers of tips on each of its daughter-branches.
cophenetic.phylo** ape Pairwise distances between the pairs of tips from a phylogenetic tree using its branch lengths.
dist.nodes** ape Like cophenetic.phylo, but includes internal nodes.
dist.topo* ape Topological distance between two trees using the method from Penny & Hendy (1985).
gammaStat ape (pybus.gamma in Kaphi)
avgladder phyloTop Mean size of ladders in the tree.
cherries yloTop (n.cherries in Kaphi)
colless.phylo phyloTop (colless in Kaphi)
getDepths** phyloTop Returns a list of two vectors: tipDepths and nodeDepths.
ILnumber phyloTop (IL.nodes in Kaphi)
maxHeight phyloTop (ladder.length in Kaphi)
pitchforks phyloTop Number of clades with three tips.
sackin.phylo phyloTop (sackin in Kaphi)

* requires two trees. ** output is non-scalar.

ArtPoon commented 7 years ago

Our decision was to have the user specify a distance by referring to functions in other packages by name, but this string would be parsed into a valid R expression (including the extraction of numerical values) by our package.
For example, if there are functions foo and bar in one of the packages, then the user should be able to specify a distance as foo+bar, which would be parsed and converted to an R expression: abs(pkg::foo(x)-pkg::foo(y)) + abs(pkg::bar(x)-pkg::bar(y)).

ArtPoon commented 7 years ago

Let's just focus on ape and phyloTop for now, and skip the functions that are already implemented in Kaphi.

MathiasRenaud commented 7 years ago

Instead of having the user specify "foo+bar", it will have to be something like "foo(param1=0.1, param2=TRUE)+bar(param1='example', param2=FALSE)" in order to be able to include any additional arguments (e.g. decay.factor, rbf.variance, set.control for kernel).

ArtPoon commented 7 years ago

Yep that's the idea. There should also be sensible defaults -- however, should we require parentheses or not? I'm inclined to dropping parentheses for default behaviour.

MathiasRenaud commented 7 years ago

If we drop the parentheses, how should I have the user indicate which function the parameters belong to? Would commas suffice? "foo,param1=0.1,param2=TRUE+..."

Also, I'm working on this issue on the issue41 branch.

ArtPoon commented 7 years ago

If the user wants to use default parameters, then they would either use empty parentheses: Sackin() or nothing at all: sackin

I'm asking your opinions on which one would be better.

Sent from my iPhone

On Aug 3, 2017, at 11:36 PM, Mathias Renaud notifications@github.com wrote:

If we drop the parentheses, how should I have the user indicate which function the parameters belong to?

Also, I'm working on this issue on the issue41 branch.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

MathiasRenaud commented 7 years ago

I understand now! I think that dropping parentheses would be a good way of indicating that default parameters are to be used.

MathiasRenaud commented 7 years ago

Example:

'0.8*kernel.dist(decay.factor=0.2,rbf.variance=100.0,sst.control=1.0)+0.1*sackin(use.branch.lengths=TRUE)+0.3*colless'
ArtPoon commented 7 years ago

Hm, now that I see it that way, I have to say it looks kind of odd... What if we treat this expression as a sort of lambda function with a variable x? For instance, your example would be written:

0.8*kernel.dist(x, decay.factor=0.2, rbf.variance=100.0, sst.control=1.0) + 
  0.1*sackin(x, use.branch.lengths=TRUE) + 
  0.3*colless(x)
MathiasRenaud commented 7 years ago

Would kernel.dist then be: 0.8*kernel.dist(x, y, decay.factor=0.2,...) ?

MathiasRenaud commented 7 years ago

I've updated the parsing for the new format and am working on the wrapper functions for the functions that return non-scalar outputs. For ape::balance(x), the return value is a matrix with a row for each node and two columns, giving the number of decedents from each branch of each node:

> t5 <- read.tree(text="(((A:0.1,B:0.1):0.15,C:0.25):0.05,D:0.3):0;")
> balance(t5)
  [,1] [,2]
5    3    1
6    2    1
7    1    1

To represent the data in the matrix as a scalar, I wrote a function to take the difference between the two columns for each row, then take the average of the differences.

> balance.met(t5)
[1] 1
ArtPoon commented 7 years ago

Yeah I've been thinking about this... I had wanted to spare the user from having to write out stat(x)-stat(y) over and over, and requiring the notation kernel.dist(x,y) would be consistent with that approach.

Based on our approach of arithmetic differences being the assumed method for summary statistics (although we can add another parameter to get around this...), I think the consistent notation would be x only.

Sent from my iPhone

On Aug 4, 2017, at 6:54 PM, Mathias Renaud notifications@github.com wrote:

Would kernel.dist then be: 0.8*kernel.dist(x, y, decay.factor=0.2,...) ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ArtPoon commented 7 years ago

After discussion at meeting, we decided to emulate the R package ergm model specification scheme where default model terms do not have parentheses, and there is no placeholder variable for the object, i.e.,, tree x and y.

ArtPoon commented 7 years ago

I'm going to peel off some of the content of this issue to a separate issue, and return this issue to the original problem of working with tree metrics that have non-scalar outputs.

ArtPoon commented 7 years ago

Cophenetic function implemented in Kaphi appears to be cophenetic index, which is different than implementation in ape that returns a pairwise (patristic) distance matrix.

MathiasRenaud commented 7 years ago

cophenetic in Kaphi has been updated to cophenetic.index

This will help differentiate this function from cophenetic.phylo in ape and cophenetic in stats.

MathiasRenaud commented 7 years ago

Also, I looked into the function balance in ape and it essentially gives the same information as Colless' index. I'll remove the implementation of balance since colless in Kaphi will measure tree imbalance.