Closed ArtPoon closed 7 years ago
@rmcclosk already implemented several of these statistics in treestats.c
. To avoid further duplication of functionality, we should bring in another R package like phyloTop.
Discussed this at dev meeting. I think the way forward is to require the user to specify an R expression as a string that represents a linear combination of functions for processing two trees x
and y
:
"kernel(x, y, gauss=1.0, sigma=1.0) + 2.7 * (sackin(x) - sackin(y)) + 0.0017 * (tree.width(x) - tree.width(y))"
where this expression would be evaluated in a generic distance function. This function would have to throw an exception if one or more of the functions in the expression are not available (for example, if the corresponding R library was not loaded). The expression should be specified as an argument to an smc.config
object under keyword dist
.
The advantage of this approach is that the user is able to use whatever tree shape metrics or distances they want to. The disadvantage is that the user can use whatever metrics and distances they want to.
[x] strip out functions where kernel distance is hard-coded
[x] write function that takes two trees as argument and applies dist
expression to calculate the composite distance, returning a number -Mathias
[x] write function to parse and validate expression, checking for functions that are not present, store distance computation as a function -Tammy
[x] identify existing R packages that have tree shape / distance functions -Tammy
let's not expose dist
expression specification for the Shiny app for now
in case of namespace collision (mulitple packages with sackin
function, for example), we might be able to use package prefix to specify a specific implementation, e.g., Kaphi::sackin
Package | Tree Stats |
---|---|
ape |
|
phyloTop |
|
phangorn |
|
apTreeshape |
|
For the task of writing a function that takes 2 trees and applies the dist
expression that was parsed in Tammy's function, I am going to use the existing function, distance(t1, t2, config)
, and update it to use dist
rather than kernel
. I believe that doing so would fulfill the requirements of the function and also work toward stripping out the hard-coded kernel distance.
Provide two methods for specifying a distance function:
eval(parse())
- the expression should define a linear combination of R functions where each function returns a distance
The first method would be more useful for batch processing, but the second should be more user-friendly.Bear in mind that the kernel function is a similarity measure and we should take the complement (1-k) to get a distance.
@ArtPoon, is the user expected to explicitly write "kernel(x,y) + (sackin(x) - sackin(y))" or just specify which functions: "kernel(x,y) + sackin(x)". The second case would then require parsing the function from the expression and returning the expression with (sackin(x) - sackin(y))
.
The user is expected to write out the full expression:
kernel(x,y) + (sackin(x) - sackin(y))
because there are potentially many approaches to utilizing sackin
other than the difference.
The limitation of yaml is that we can't specify arithmetic operations among tree shape statistics like we could with an R expression. We have to assume that if the statistic/function takes a single tree argument, then the distance term is the difference between x
and y
tree arguments.
@gtng92 and I have implemented functions so that users can specify their choice of tree shape statistics. The main problem we still have is that certain tree stats functions are not compatible.
For example, the output of ape::cophenetic.phylo
is a matrix of the pairwise distances between each tip in the tree. If the two trees being compared do not have the same number of tips it causes an error. If the trees do have the same number of tips, the entire output of the distance function is coerced to a matrix.
Another case is that some functions (e.g. ape::gammaStat, Kaphi::pybus.gamma) return NA
or NaN
(if the given trees do not meet some criteria, such as not enough tips), which causes the final distance to become NA
. This should be an easy enough fix to check for instances of NA
and remove them from the distance measure.
At this point, it works with most of the Kaphi tree stats functions since most take one tree and return a numeric value.
Okay, please record these problems as separate issues. Thanks!
On Jul 14, 2017, at 11:36 AM, Mathias Renaud notifications@github.com wrote:
@gtng92 and I have implemented functions so that users can specify their choice of tree shape statistics. The main problem we still have is that certain tree stats functions are not compatible.
For example, the output of ape::cophenetic.phylo is a matrix of the pairwise distances between each tip in the tree. If the two trees being compared do not have the same number of tips it causes an error. If the trees do have the same number of tips, the entire output of the distance function is coerced to a matrix.
Another case is that some functions (e.g. ape::gammaStat, Kaphi::pybus.gamma) return NA or NaN (if the given trees do not meet some criteria, such as not enough tips), which causes the final distance to become NA. This should be an easy enough fix to check for instances of NA and remove them from the distance measure.
At this point, it works with most of the Kaphi tree stats functions since most take one tree and return a numeric value.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Closing since remaining bugs have their own issues.
We need to evaluate the kernel method against tree shape statistics and to provide the option to use any combination of such statistics as a similarity measure. It would be nice to use R's model specification format as in glm for this, i.e.:
~ colless + cherries + kernel