Closed MathiasRenaud closed 7 years ago
Cophenetic is not going to work as a distance metric unless the trees have the same labels (including the number of tips). The reason is that it is otherwise not clear how to align the two matrices (shuffle rows and columns). I think a reasonable approach for comparing two matrices given this condition is met is to use a matrix correlation coefficient.
To elaborate on the second problem, some tree shape functions implemented in other packages do not return a scalar value, but instead a structured value such as a list. As a result, the number we want would be accessed by result$estimate
, for example.
Options:
...+sackin(t1)$est +...
treekernel
)All three options suck.
Another possibility is to parse the R expression (distance specification string) and add the inline operations required to extract scalar quantities from each term. For example:
expr <- "sackin(x)-sackin(y)" # original expression
expr2 <- "sackin(x)$val-sackin(y)$val" # parsed expression
assuming that sackin
returns a list including label val
.
This would also provide the opportunity to simplify the distance specification string, so the user only has to name sackin
and we would add the absolute difference on trees x
and y
.
Add packages as prereqs.
Updated Tree Stats Functions:
Tree Statistic | Package | Description |
---|---|---|
tree.kernel* | Kaphi | The kernel distance between two trees. |
Does not currently work | ||
sackin | Kaphi | Sackin index. |
colless | Kaphi | Colless imbalance number. |
Does not currently work | ||
ladder.length | Kaphi | Max ladder length. |
IL.nodes | Kaphi | Number of internal nodes with one leaf. |
tree.width | Kaphi | Max width divided by max depth. |
max.delta.width | Kaphi | Max difference in in width between two levels. |
n.cherries | Kaphi | Number of cherries(node with two leaves). |
prop.unbalanced | Kaphi | Proportion of unbalanced subtrees. |
avg.unbalance | Kaphi | Average ratio of unbalanced subtrees. |
pybus.gamma | Kaphi | Pybus' gamma statistic |
internal.terminal.ratio | Kaphi | Ratio of internal to terminal branches. |
balance** | ape | For each node of the tree, the numbers of tips on each of its daughter-branches. |
cophenetic.phylo** | ape | Pairwise distances between the pairs of tips from a phylogenetic tree using its branch lengths. |
dist.nodes** | ape | Like cophenetic.phylo , but includes internal nodes. |
dist.topo* | ape | Topological distance between two trees using the method from Penny & Hendy (1985). |
(pybus.gamma in Kaphi) | ||
avgladder | phyloTop | Mean size of ladders in the tree. |
(n.cherries in Kaphi) | ||
(colless in Kaphi) | ||
getDepths** | phyloTop | Returns a list of two vectors: tipDepths and nodeDepths . |
(IL.nodes in Kaphi) | ||
(ladder.length in Kaphi) | ||
pitchforks | phyloTop | Number of clades with three tips. |
(sackin in Kaphi) |
* requires two trees. ** output is non-scalar.
Our decision was to have the user specify a distance by referring to functions in other packages by name, but this string would be parsed into a valid R expression (including the extraction of numerical values) by our package.
For example, if there are functions foo
and bar
in one of the packages, then the user should be able to specify a distance as foo+bar
, which would be parsed and converted to an R expression: abs(pkg::foo(x)-pkg::foo(y)) + abs(pkg::bar(x)-pkg::bar(y))
.
Let's just focus on ape
and phyloTop
for now, and skip the functions that are already implemented in Kaphi.
Instead of having the user specify "foo+bar"
, it will have to be something like "foo(param1=0.1, param2=TRUE)+bar(param1='example', param2=FALSE)"
in order to be able to include any additional arguments (e.g. decay.factor, rbf.variance, set.control for kernel).
Yep that's the idea. There should also be sensible defaults -- however, should we require parentheses or not? I'm inclined to dropping parentheses for default behaviour.
If we drop the parentheses, how should I have the user indicate which function the parameters belong to? Would commas suffice? "foo,param1=0.1,param2=TRUE+..."
Also, I'm working on this issue on the issue41 branch.
If the user wants to use default parameters, then they would either use empty parentheses: Sackin() or nothing at all: sackin
I'm asking your opinions on which one would be better.
Sent from my iPhone
On Aug 3, 2017, at 11:36 PM, Mathias Renaud notifications@github.com wrote:
If we drop the parentheses, how should I have the user indicate which function the parameters belong to?
Also, I'm working on this issue on the issue41 branch.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
I understand now! I think that dropping parentheses would be a good way of indicating that default parameters are to be used.
Example:
'0.8*kernel.dist(decay.factor=0.2,rbf.variance=100.0,sst.control=1.0)+0.1*sackin(use.branch.lengths=TRUE)+0.3*colless'
Hm, now that I see it that way, I have to say it looks kind of odd... What if we treat this expression as a sort of lambda function with a variable x
? For instance, your example would be written:
0.8*kernel.dist(x, decay.factor=0.2, rbf.variance=100.0, sst.control=1.0) +
0.1*sackin(x, use.branch.lengths=TRUE) +
0.3*colless(x)
Would kernel.dist then be: 0.8*kernel.dist(x, y, decay.factor=0.2,...)
?
I've updated the parsing for the new format and am working on the wrapper functions for the functions that return non-scalar outputs.
For ape::balance(x)
, the return value is a matrix with a row for each node and two columns, giving the number of decedents from each branch of each node:
> t5 <- read.tree(text="(((A:0.1,B:0.1):0.15,C:0.25):0.05,D:0.3):0;")
> balance(t5)
[,1] [,2]
5 3 1
6 2 1
7 1 1
To represent the data in the matrix as a scalar, I wrote a function to take the difference between the two columns for each row, then take the average of the differences.
> balance.met(t5)
[1] 1
Yeah I've been thinking about this... I had wanted to spare the user from having to write out stat(x)-stat(y)
over and over, and requiring the notation kernel.dist(x,y)
would be consistent with that approach.
Based on our approach of arithmetic differences being the assumed method for summary statistics (although we can add another parameter to get around this...), I think the consistent notation would be x
only.
Sent from my iPhone
On Aug 4, 2017, at 6:54 PM, Mathias Renaud notifications@github.com wrote:
Would kernel.dist then be: 0.8*kernel.dist(x, y, decay.factor=0.2,...) ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
After discussion at meeting, we decided to emulate the R package ergm
model specification scheme where default model terms do not have parentheses, and there is no placeholder variable for the object, i.e.,, tree x
and y
.
I'm going to peel off some of the content of this issue to a separate issue, and return this issue to the original problem of working with tree metrics that have non-scalar outputs.
Cophenetic function implemented in Kaphi appears to be cophenetic index, which is different than implementation in ape that returns a pairwise (patristic) distance matrix.
cophenetic
in Kaphi has been updated to cophenetic.index
This will help differentiate this function from cophenetic.phylo
in ape and cophenetic
in stats.
Also, I looked into the function balance
in ape and it essentially gives the same information as Colless' index. I'll remove the implementation of balance
since colless
in Kaphi will measure tree imbalance.
For example, the output of ape::cophenetic.phylo is a matrix of the pairwise distances between each tip in the tree. If the two trees being compared do not have the same number of tips it causes an error. If the trees do have the same number of tips, the entire output of the distance function is coerced to a matrix.