emmanuelparadis / ape

Analysis of Phylogenetics and Evolution
https://emmanuelparadis.github.io/
GNU General Public License v2.0
53 stars 11 forks source link

version of keep.tip which considers internal nodes as well? #48

Closed HedvigS closed 2 years ago

HedvigS commented 2 years ago

As it is now, ape::keep.tip() only operates on terminal nodes, i.e. tips, which is all well and good. But sometimes, what I want is something like "keep.node()" where I specify a list of names/locations that apply to internal nodes and/or tips. I could do some iterative keep.tip() applications, but I've found that to be prone to errors and a bit clunky in the application.

There is a way of doing this in python with the newick package, I've made a script which exemplifies this here. The method in question is prune_by_names(inverse = True).

This is mostly useful when you have a structure with for example species nodes and subspecies tips where many/all nodes are even named. In my case as a linguist, this comes in handy with language trees with tips that are dialects and internal nodes that are classified as languages.

I've noticed that this is a functionality other people need as well, see for example @erichround 's solution here.

Personally, I'd really appreciate it if this was a functionality that the ape package would incorporate directly into a function. I like ape and I trust that it's to be maintained and checked for bugs long term, and that it's also one of the first places many other people will go to look for such a function (i.e. it will be scrutinized for bugs etc regularly).

So, I'd humbly like to request that this be considered. I understand that this is an enhancement and addition rather than a bug fix or issue per se. Thank you for your time and work.

emmanuelparadis commented 2 years ago

It seems subtrees does what you are looking for. Maybe a more efficient implementation would use prop.part and keep.tip. Of course, this would work with rooted trees only. In one line, this could be:

keep.tip(prop.part(phy)[[node - Ntip(phy)]])

phy is the tree and node the node number. This can be improved using optionally a node label, etc.

HedvigS commented 2 years ago

Ah I see, that's a way of going about it I hadn't thought about. I thought I had tried the subtreesapproach but that something wasn't working. Based on the documentation I wouldn't have thought this would do what I needed, but I could be misunderstanding.

I'll try the other approach with keep.tip andprop.part, that looks neat!

Would it be too greedy and lazy to suggest a function in ape called something like keep.node() which is essentially just a wrapper for the solution with keep.tip and prop.part? Especially one that handles node labels neatly instead of numbers.

emmanuelparadis commented 2 years ago

I forgot about extract.clade: it seems to do exactly what you want.

HedvigS commented 2 years ago

Oh? Doesn't extract clade keep all the tips that are descendants of a certain node? Would that just keep any tips that are for example dialects? What I want is sort of a reverse extract.clade, which keeps all the nodes that I list including all of their ancestors and drops any nodes or tips below the nodes in the list.

HedvigS commented 2 years ago

I can make a toy example and show you what I mean, maybe that's easiest :)

emmanuelparadis commented 2 years ago

So what you want is:

drop.tip(prop.part(phy)[[node - Ntip(phy)]])

?

HedvigS commented 2 years ago

I'm not sure, I don't think so. At least I can't get that to work.

Let's say I have this tree:

library(ape)
library(tidyverse)

phy <- read.tree(text = "(A,B,(C,D)E)F;") %>% as.phylo()

and I'd like to prune it such that A, B and E are the tips. I.e. in this particular case, I just want to remove C and D. For the moment let's forget about the approach of just dropping C and D, because I want the more general approach that takes any node and makes that the terminal.

#figuring out what the node number is of E

mrca_node <- getMRCA(phy, tip = c("C", "D"))

phy_trimmed <- drop.tip(prop.part(phy)[[mrca_node - Ntip(phy)]])

> Error in drop.tip(prop.part(phy)[[mrca_node - Ntip(phy)]]) : 
  object "phy" is not of class "phylo"
emmanuelparadis commented 2 years ago

read.tree returns an object of class "phylo", so no need to send it to as.phylo().

What about this: drop.tip(phy, prop.part(phy)[[mrca_node - Ntip(phy)]], trim.internal = FALSE)? You may write it on several lines for better clarity (and more efficient if you need to repeat this with the same tree):

pp <- prop.part(phy)
n <- Ntip(phy)
thenode <- pp[[mrca_node - n]]
drop.tip(phy, thenode, trim.internal = FALSE)`
HedvigS commented 2 years ago

read.tree returns an object of class "phylo", so no need to send it to as.phylo().

Yes that's what I thought as well, but then I got the error message object "phy" is not of class "phylo" so i got a bit confused.

What about this: drop.tip(phy, prop.part(phy)[[mrca_node - Ntip(phy)]], trim.internal = FALSE)? You may write it on several lines for better clarity (and more efficient if you need to repeat this with the same tree):

pp <- prop.part(phy)
n <- Ntip(phy)
thenode <- pp[[mrca_node - n]]
drop.tip(phy, thenode, trim.internal = FALSE)`

Okay I'll try that!

HedvigS commented 2 years ago

I think that did the trick, I'll try it with some more complex cases. Thanks!

HedvigS commented 2 years ago

Okay, I've got a more complex use case here for my original problem. Here is a tree with node and tip labels and a number of nodes and tips that I want to keep:

library(ape)

mayan_tree_nwk_string <- "(((((agua1252:1,ixil1251:1)ixil1250:1,(mamm1241:1,tekt1235:1)mame1240:1)grea1277:1,(((kaqc1270:1,tzut1248:1)cakc1244:1,(achi1256:1,kich1262:1)quic1275:1,saca1238:1,sipa1247:1)core1251:1,kekc1242:1,(poqo1253:1,poqo1254:1)poco1241:1,uspa1245:1)grea1276:1)quic1274:1,((((chol1282:1,(buen1245:1,mira1253:1,tamu1247:1)taba1266:1)chol1281:1,(chol1283:1,chor1273:1)chor1272:1,epig1241:1)chol1287:1,((chan1320:1,tena1239:1)tzel1254:1,tzot1259:1)tzel1253:1)chol1286:1,((chuj1250:1,tojo1241:1)chuj1249:1,((west2635:1,popt1235:1,qanj1241:1)kanj1263:1,(moto1243:1,tuza1238:1)moch1257:1)kanj1262:1)kanj1261:1)west2865:1,((itza1241:1,mopa1243:1)mopa1242:1,((laca1244:1,naja1242:1)laca1243:1,yuca1254:1)yuca1253:1)yuca1252:1)core1254:1,(chic1271:1,huas1242:1)huas1241:1)maya1287:1;"

mayan_tree <- ape::read.tree(text = mayan_tree_nwk_string)

tip_and_nodes_to_keep <- c("agua1252", "ixil1251", "mamm1241", "tekt1235", "kaqc1270", "tzut1248" ,"achi1256", "kich1262", "saca1238", "sipa1247" ,"kekc1242", "poqo1253", "poqo1254", "uspa1245","chol1282","chol1283" ,"chor1273" ,"epig1241", "tzot1259",  "west2635" ,"popt1235" ,"qanj1241" ,"itza1241" ,"mopa1243", "yuca1254" ,"chic1271", "huas1242","chuj1249", "mira1253")

Any node or tip not mentioned in the vector should be dropped and the branches above should be trimmed until they reach a lineage that ends in a node/tip which is in the vector.

I could work out some solution with the above approach with a for-loop, but. Hm. I'm just thinking there should be an easier way.

HedvigS commented 2 years ago

As always I'm very grateful for your time. It's a great package you've made!

HedvigS commented 2 years ago

Now... I am by no means saying this is pretty, elegant or robust.. but this does work...

library(ape)
library(tidyverse)
library(assertthat)

mayan_tree_nwk_string <- "(((((agua1252:1,ixil1251:1)ixil1250:1,(mamm1241:1,tekt1235:1)mame1240:1)grea1277:1,(((kaqc1270:1,tzut1248:1)cakc1244:1,(achi1256:1,kich1262:1)quic1275:1,saca1238:1,sipa1247:1)core1251:1,kekc1242:1,(poqo1253:1,poqo1254:1)poco1241:1,uspa1245:1)grea1276:1)quic1274:1,((((chol1282:1,(buen1245:1,mira1253:1,tamu1247:1)taba1266:1)chol1281:1,(chol1283:1,chor1273:1)chor1272:1,epig1241:1)chol1287:1,((chan1320:1,tena1239:1)tzel1254:1,tzot1259:1)tzel1253:1)chol1286:1,((chuj1250:1,tojo1241:1)chuj1249:1,((west2635:1,popt1235:1,qanj1241:1)kanj1263:1,(moto1243:1,tuza1238:1)moch1257:1)kanj1262:1)kanj1261:1)west2865:1,((itza1241:1,mopa1243:1)mopa1242:1,((laca1244:1,naja1242:1)laca1243:1,yuca1254:1)yuca1253:1)yuca1252:1)core1254:1,(chic1271:1,huas1242:1)huas1241:1)maya1287:1;"

mayan_tree <- ape::read.tree(text = mayan_tree_nwk_string)

tips_and_nodes_to_keep <- c("agua1252", "ixil1251", "mamm1241", "tekt1235", "kaqc1270", "tzut1248" ,"achi1256", "kich1262", "saca1238", "sipa1247" ,"kekc1242", "poqo1253", "poqo1254", "uspa1245","chol1282","chol1283" ,"chor1273" ,"epig1241", "tzot1259",  "west2635" ,"popt1235" ,"qanj1241" ,"itza1241" ,"mopa1243", "yuca1254" ,"chic1271", "huas1242","chuj1249", "mira1253")

keep.nodes_and_tips<- function(tree, tips_and_nodes_to_keep){

#for bug running example
#  tree <- mayan_tree
#  tips_and_nodes_to_keep <- tips_and_nodes_to_keep

#making a vector of the items in tips_and_nodes_to_keep that are indeed nodes
nodes_to_keep <-  tree$node.label %>% 
  as.data.frame() %>% 
  rename(tips_and_nodes_to_keep = ".") %>% 
  inner_join(as.data.frame(tips_and_nodes_to_keep), by = "tips_and_nodes_to_keep")

#making a table with the node number of each node.label and also it's associated children
node_labels_in_edge <- tree$node.label[tree$edge[,1]-Ntip(tree)]
tips_nodes <- tree$edge[,2]

select.tip.or.node <- function(element, tree) {
  ifelse(element < Ntip(tree)+1, tree$tip.label[element], tree$node.label[element-Ntip(tree)])
}

edge_table <- data.frame(
  "parent" = tree$edge[,1],
  "parent.name" = sapply(tree$edge[,1], select.tip.or.node, tree = tree),
  "child" = tree$edge[,2],
  "child.name" = sapply(tree$edge[,2], select.tip.or.node, tree = tree)
)

#make a vector of the children of nodes that we are keeping. This will be used to give to ape::drop.tip()
children_to_trim <- nodes_to_keep %>% 
  rename(parent.name = tips_and_nodes_to_keep) %>% 
  inner_join(edge_table, by = "parent.name") %>% 
  dplyr::select(child.name) %>% 
  .[,1]

#dropping children of nodes we are keeping, not trimming internally so that we do keep the node
tree_trimmed_for_children <- ape::drop.tip(tree, children_to_trim, trim.internal = F)

#use keep.tip to trim away any other tips we don't want
tree_trimmed_for_children_and_tips <- ape::keep.tip(tree_trimmed_for_children, tips_and_nodes_to_keep )

#double checking that we are indeed keeping what we intended to keep, not more or less
n_to_keep <- tips_and_nodes_to_keep %>% length()

joined <- tree_trimmed_for_children_and_tips$tip.label %>% 
  as.data.frame() %>%
  rename(tips_and_nodes_to_keep = ".") %>% 
  inner_join(as.data.frame(tips_and_nodes_to_keep), by = "tips_and_nodes_to_keep")

x <- assertthat::assert_that(nrow(joined) == n_to_keep, msg = "The resulting tree does not match the list of the nodes and tips to keep.")

#outputting the trimmed tree
tree_trimmed_for_children_and_tips
}

trimmed_tree <- keep.nodes_and_tips(tree = mayan_tree, tips_and_nodes_to_keep = tips_and_nodes_to_keep)

This example has nodes to keep which are just one level above the tips, I'd have to tweak it some more for accomodating nodes higher up.

***EDIT I updated it to use assert_that()

HedvigS commented 2 years ago

I can un-tidyversify it. I made it like this with a lot of unnecessary data frames to make it easier to step through and sanity check for myself. I know that there are neater base-ways of doing these matches etc.

HedvigS commented 2 years ago

There are two things I'd like to address with improvements to my solution above:

1) make it applicable to any level of the internal nodes to keep 2) have a sanity check that the list of tips and nodes to keep doesn't contain tips that are children of a node that is to be kept

HedvigS commented 2 years ago

Alright, I've made an improved version that addresses point (1) above :)

library(ape)
library(tidyverse)
library(assertthat)
library(phangorn)

#demo data
mayan_tree_nwk_string <- "(((((agua1252:1,ixil1251:1)ixil1250:1,(mamm1241:1,tekt1235:1)mame1240:1)grea1277:1,(((kaqc1270:1,tzut1248:1)cakc1244:1,(achi1256:1,kich1262:1)quic1275:1,saca1238:1,sipa1247:1)core1251:1,kekc1242:1,(poqo1253:1,poqo1254:1)poco1241:1,uspa1245:1)grea1276:1)quic1274:1,((((chol1282:1,(buen1245:1,mira1253:1,tamu1247:1)taba1266:1)chol1281:1,(chol1283:1,chor1273:1)chor1272:1,epig1241:1)chol1287:1,((chan1320:1,tena1239:1)tzel1254:1,tzot1259:1)tzel1253:1)chol1286:1,((chuj1250:1,tojo1241:1)chuj1249:1,((west2635:1,popt1235:1,qanj1241:1)kanj1263:1,(moto1243:1,tuza1238:1)moch1257:1)kanj1262:1)kanj1261:1)west2865:1,((itza1241:1,mopa1243:1)mopa1242:1,((laca1244:1,naja1242:1)laca1243:1,yuca1254:1)yuca1253:1)yuca1252:1)core1254:1,(chic1271:1,huas1242:1)huas1241:1)maya1287:1;"

mayan_tree <- ape::read.tree(text = mayan_tree_nwk_string)

clade_tree <- ape::extract.clade(mayan_tree, 50)

tips_and_nodes_to_keep <- c("kanj1261", "tzel1253", "epig1241")

##FUNS
fun_Descendants_node_labels <- function(tree, node, type = "tips"){

  #making a table with the node number of each node.label and also it's associated children
  node_labels_in_edge <- tree$node.label[tree$edge[,1]-Ntip(tree)]
  tips_nodes <- tree$edge[,2]

  select.tip.or.node <- function(element, tree) {
    ifelse(element < Ntip(tree)+1, tree$tip.label[element], tree$node.label[element-Ntip(tree)])
  }

  edge_table <- data.frame(
    "parent" = tree$edge[,1], #node number of the parent
    "parent.name" = sapply(tree$edge[,1], select.tip.or.node, tree = tree),
    "child" = tree$edge[,2], #node number of the child
    "child.name" = sapply(tree$edge[,2], select.tip.or.node, tree = tree)
  )

  nodes_to_keep_number <- node %>% 
    as.data.frame() %>% 
    rename(parent.name = ".") %>% 
    inner_join(edge_table, by = "parent.name") %>% 
    dplyr::select("parent") %>% 
    .[,1] %>% 
    unique()

descendants_node_numbers <- phangorn::Descendants(tree, node = nodes_to_keep_number, type = type)  %>%  
  unlist() %>% 
  as.numeric()

descendants_tips_labels <- descendants_node_numbers %>% 
  as.data.frame() %>% 
  rename(child = ".") %>% 
  inner_join(edge_table, by = "child") %>% 
  dplyr::select("child.name") %>% 
  .[,1] %>% 
  unique()

descendants_tips_labels
}

keep.nodes_and_tips<- function(tree, tips_and_nodes_to_keep){

#for bug running example
#  tree <- clade_tree

#making a vector of the items in tips_and_nodes_to_keep that are indeed nodes

nodes_df <- tree$node.label %>% 
  as.data.frame()   

nodes_to_keep <-  nodes_df %>% 
  rename(tips_and_nodes_to_keep = ".") %>% 
  inner_join(as.data.frame(tips_and_nodes_to_keep), by = "tips_and_nodes_to_keep") %>% 
  .[,1]

for (node_name in nodes_to_keep) {
#node_name <- nodes_to_keep[2]
    tips_to_drop <- fun_Descendants_node_labels(tree, node = node_name, type = "tips") %>% 
    unlist()

  while (length(tips_to_drop) !=0){

    tree <- ape::drop.tip(tree, tip = tips_to_drop, trim.internal = F, collapse.singles = F) 
    tips_to_drop <- fun_Descendants_node_labels(tree, node = node_name, type = "tips") %>% 
      unlist()
      }
}

tree <- keep.tip(tree, tip = tips_and_nodes_to_keep)

#double checking that we are indeed keeping what we intended to keep, not more or less
n_to_keep <- tips_and_nodes_to_keep %>% length()

joined <- tree$tip.label %>% 
  as.data.frame() %>%
  rename(tips_and_nodes_to_keep = ".") %>% 
  inner_join(as.data.frame(tips_and_nodes_to_keep), by = "tips_and_nodes_to_keep")

x <- assertthat::assert_that(nrow(joined) == n_to_keep, msg = "The resulting tree does not match the list of the nodes and tips to keep.")

#outputting the trimmed tree
tree
}

trimmed_tree <- keep.nodes_and_tips(tree = clade_tree, tips_and_nodes_to_keep = tips_and_nodes_to_keep)

trimmed_tree %>% plot.phylo(show.node.label = T)
HedvigS commented 2 years ago

I improved a little bit again on my solution and uploaded it here.

For the moment being, this solves my current problems. If you want to consider adopting something like this in your package, we can continue discussing but otherwise we could close this now.

HedvigS commented 2 years ago

I improved a bit on my solution once again, and wrote a short explainer of the three methods I've encountered so far for doing this.