Incorporating binary labels into kernel distance

ArtPoon commented 6 years ago

Make a new branch
Move tree-processing level functions from tree-kernel.R to a new file
tree.kernel should take regular expressions as label arguments instead of expecting character vectors
the regexes should be applied within tree.kernel() to classify tip labels from each tree into a finite number of categories, that can be represented by an integer-valued vector. These two integer vectors will be passed to C-level kernel computation.

ArtPoon commented 6 years ago

@gtng92 pointed out that the kernel distance can be called on trees x and y as k(x,y) or k(y,x), and that if we define two regular expressions then these trees could potentially be processed differently. After discussion we decided to use just one regex for kernel distances.

ArtPoon commented 6 years ago

Please write unit tests to check whether labeled kernel function is behaving properly before closing

ArtPoon commented 6 years ago

On branch issue133, we presently have this in treekernel.R (dropping commented lines):

tree.kernel <- function(tree1, tree2,
                        lambda,        # decay factor
                        sigma,         # RBF variance parameter
                        rho=1.0,         # SST control parameter; 0 = subtree kernel, 1 = subset tree kernel
                        normalize=0,   # normalize kernel score by sqrt(k(t1,t1) * k(t2,t2))
                        regexPattern="",     # arguments for labeled tree kernel
                        regexReplacement="",
                        gamma=0        # label factor
                        ) {
  # make labels
  use.label <- if (any(is.na(label1)) || any(is.na(label2)) || is.null(label1) || is.null(label2)) {
    FALSE
  } else {
    new_label1 <- gsub(regexPattern, regexReplacement, tree1$tip.label)
    new_label2 <- gsub(regexPattern, regexReplacement, tree2$tip.label)
    TRUE
  }

  nwk1 <- .to.newick(tree1)
  nwk2 <- .to.newick(tree2)

  res <- .Call("R_Kaphi_kernel",
                 nwk1, nwk2, lambda, sigma, as.double(rho), use.label, gamma, normalize,
                 PACKAGE="Kaphi")
  return (res)
}

We want to make these changes:

user provides regular expressions that determine how substrings that define states are extracted from tip labels --- tip labels have to be unique, but also share some substring in common that tells us whether two tips share the same state, e.g., were sampled from the same compartment
instead of gamma, user should pass a matrix of weights that includes row and column names. These names should correspond to the substrings that are extracted from tip labels by the regular expression.
This function should use both arguments to convert tip labels in either tree into integer-valued vectors, where the integers are indices into the weight matrix. The two integer vectors and the weight matrix (without row/column names) are passed to the C function as vectors (for the matrix, the number of rows and columns is given by the maximum integer values in the respective integer vectors).

ArtPoon commented 6 years ago

regexReplacement should be \\1 by default (capture a single group). There may be a situation where we want to concatenate two or more groups, so I guess we can let the user define a more complex label like "\1\2".

gtng92 commented 6 years ago

On branch issue133, implementation changed so that the weight matrix is no longer necessary.

user provides regex to extract the substrings from the tip labels
user provides character vector of all possible states
each tip label is assigned a binary encoded integer value reflective of the state(s) the tip label (5fe35e1)
integer vectors are passed down into C level (24aed96), where the integer value is then decoded and the different states are matched or mismatched

ArtPoon commented 6 years ago

We want to refactor the kernel to encode labels in each node's production. Whereas before productions can only take one of four values (0 for terminal node, 1 for node with two non-terminal descendants, etc.), we now want to have each internal node have a tuple (pair) of integers for productions, and reserve the integer value -1 when the descendant is an internal node.

ArtPoon commented 6 years ago

New labeled kernel is being prototyped in Python, see PoonLab/coevolution phyloK3.py

ArtPoon commented 5 years ago

Need to port Python implementation into R

PoonLab / Kaphi

Incorporating binary labels into kernel distance #133