YuLab-SMU / treeio

:seedling: Base Classes and Functions for Phylogenetic Tree Input and Output
https://yulab-smu.top/treedata-book/
94 stars 24 forks source link

update full_join method #92

Closed xiangpin closed 1 year ago

xiangpin commented 1 year ago

Description

update full_join method

Related Issue

full_join() on a treedata object does not work with the standard dplyr UI of by=c('columnX'='columnY') related issue is https://github.com/YuLab-SMU/tidytree/issues/32.

In addition, the original full_join will generate errors, if the external data.frame contains labels that are not present in the tree.

or the da contains duplicated node rows, the original phylo tree structure will be damaged.

> library(treeio)
> tr <- rtree(4)
> da <- data.frame(label=c('t1', 't2', 't8'), values=c(10, 20, 80))
> tr %>% full_join(da, by='label') %>% ggtree::ggtree()
> tr %>% full_join(da, by='label')
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 4 tips and 4 internal nodes.

Tip labels:
  t2, t3, t1, t4
Node labels:
  NA, NA, NA, t8

Rooted; includes branch lengths.

with the following features available:
  'values'.

# The associated data tibble abstraction: 8 × 4
# The 'node', 'label' and 'isTip' are from the phylo tree.
   node label isTip values
  <int> <chr> <lgl>  <dbl>
1     1 t2    TRUE      20
2     2 t3    TRUE      NA
3     3 t1    TRUE      10
4     4 t4    TRUE      NA
5     5 NA    FALSE     NA
6     6 NA    FALSE     NA
7     7 NA    FALSE     NA
8     8 t8    FALSE     NA
> tr <- rtree(4)
> da <- data.frame(label=c('t1', 't2', 't3', 't3'), values=c(10, 20, 80, 90))
> tr %>% full_join(da, by='label')
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 5 tips and 3 internal nodes.

Tip labels:
  t2, t1, t4, t3, t3

Rooted; includes branch lengths.

with the following features available:
  'values'.

# The associated data tibble abstraction: 11 × 4
# The 'node', 'label' and 'isTip' are from the phylo tree.
    node label isTip values
   <int> <chr> <lgl>  <dbl>
 1     1 t2    TRUE      20
 2     2 t1    TRUE      10
 3     3 t4    TRUE      NA
 4     4 t3    TRUE      80
 5     4 t3    TRUE      90
 6     4 t3    TRUE      80
 7     4 t3    TRUE      90
 8     5 t3    FALSE     NA
 9     6 NA    FALSE     NA
10     7 NA    FALSE     NA
# … with 1 more row
# ℹ Use `print(n = ...)` to see more rows

the t8 is from da, but it doesn't exist in phylo tree. I think it is better to be removed when the da was joined, so the full_join might be like the left_join on treedata or phylo class. Because it is difficult to add a new node or tip in a phylo tree without other useful information such as edge.length etc.

So this update

Example

> tr <- rtree(4)
> da <- data.frame(label=c('t1', 't2', 't8'), values=c(10, 20, 80))
> tr %>% full_join(da, by='label')
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 4 tips and 3 internal nodes.

Tip labels:
  t2, t1, t4, t3

Rooted; includes branch lengths.

with the following features available:
  '', 'values'.

# The associated data tibble abstraction: 7 × 4
# The 'node', 'label' and 'isTip' are from the phylo tree.
   node label isTip values
  <int> <chr> <lgl>  <dbl>
1     1 t2    TRUE      20
2     2 t1    TRUE      10
3     3 t4    TRUE      NA
4     4 t3    TRUE      NA
5     5 NA    FALSE     NA
6     6 NA    FALSE     NA
7     7 NA    FALSE     NA
> da <- data.frame(label=c('t1', 't2', 't3', 't3'), values=c(10, 20, 80, 90))
> tr %>% full_join(da, by='label')
'treedata' S4 object'.

...@ phylo:

Phylogenetic tree with 4 tips and 3 internal nodes.

Tip labels:
  t2, t1, t4, t3

Rooted; includes branch lengths.

with the following features available:
  '', 'values'.

# The associated data tibble abstraction: 7 × 4
# The 'node', 'label' and 'isTip' are from the phylo tree.
   node label isTip values
  <int> <chr> <lgl> <list>
1     1 t2    TRUE  <tibble [1 × 1]>
2     2 t1    TRUE  <tibble [1 × 1]>
3     3 t4    TRUE  <tibble [1 × 1]>
4     4 t3    TRUE  <tibble [2 × 1]>
5     5 NA    FALSE <tibble [1 × 1]>
6     6 NA    FALSE <tibble [1 × 1]>
7     7 NA    FALSE <tibble [1 × 1]>