TysonStanley / tidyfast

Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats
https://tysonbarrett.com/tidyfast/
187 stars 4 forks source link

unnest_dt doesn't preserve list columns #31

Closed kendonB closed 4 years ago

kendonB commented 4 years ago
library(tidyverse)
make_df <- function(x){
  tibble(numbers = c(1, 2))
}
iris %>% group_by(Species) %>% 
  nest() %>% 
  mutate(df_col = data %>% map(make_df)) %>% 
  # Now there is a column of single data.frames as well as data
  # I'd like this to unnest like a numeric column.
  unnest(data)
#> # A tibble: 150 x 6
#> # Groups:   Species [3]
#>    Species Sepal.Length Sepal.Width Petal.Length Petal.Width df_col          
#>    <fct>          <dbl>       <dbl>        <dbl>       <dbl> <list>          
#>  1 setosa           5.1         3.5          1.4         0.2 <tibble [2 x 1]>
#>  2 setosa           4.9         3            1.4         0.2 <tibble [2 x 1]>
#>  3 setosa           4.7         3.2          1.3         0.2 <tibble [2 x 1]>
#>  4 setosa           4.6         3.1          1.5         0.2 <tibble [2 x 1]>
#>  5 setosa           5           3.6          1.4         0.2 <tibble [2 x 1]>
#>  6 setosa           5.4         3.9          1.7         0.4 <tibble [2 x 1]>
#>  7 setosa           4.6         3.4          1.4         0.3 <tibble [2 x 1]>
#>  8 setosa           5           3.4          1.5         0.2 <tibble [2 x 1]>
#>  9 setosa           4.4         2.9          1.4         0.2 <tibble [2 x 1]>
#> 10 setosa           4.9         3.1          1.5         0.1 <tibble [2 x 1]>
#> # ... with 140 more rows

iris %>% group_by(Species) %>% 
  nest() %>% 
  mutate(df_col = data %>% map(make_df)) %>% 
  # Now there is a column of single data.frames as well as data
  # I'd like this to unnest like a numeric column.
  tidyfast::dt_unnest(data)
#>        Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#>   1:    setosa          5.1         3.5          1.4         0.2
#>   2:    setosa          4.9         3.0          1.4         0.2
#>   3:    setosa          4.7         3.2          1.3         0.2
#>   4:    setosa          4.6         3.1          1.5         0.2
#>   5:    setosa          5.0         3.6          1.4         0.2
#>  ---                                                            
#> 146: virginica          6.7         3.0          5.2         2.3
#> 147: virginica          6.3         2.5          5.0         1.9
#> 148: virginica          6.5         3.0          5.2         2.0
#> 149: virginica          6.2         3.4          5.4         2.3
#> 150: virginica          5.9         3.0          5.1         1.8

Created on 2020-08-07 by the reprex package (v0.3.0)

vincentarelbundock commented 4 years ago

One approach would be to merge the original data back in after unnesting. There might also be ways to improve performance somewhat before doing this. For instance,

library(microbenchmark)
library(data.table)
library(tidyfast)
library(broom)
dat = data.table(group=sample(1:500, 1e4, replace=TRUE),
                 y=rnorm(1e4), 
                 x=rnorm(1e4))
fit = function(k) lm(y ~ x, k)
dat = dat[, .(data=.(.SD)), by=group][
          , model := lapply(data, fit)][
          , result := lapply(model, tidy)]

tf = function() dt_unnest(dat, result)
keep = function(x) dat[, result[[1L]], by=group][dat, on='group']
discard = function(x) dat[, result[[1L]], by=group]

all(tf() == discard())
#> [1] TRUE

microbenchmark(tf(), keep(), discard(),
               times=100)
#> Unit: milliseconds
#>       expr      min        lq      mean    median        uq       max neval
#>       tf() 9.661796 10.762439 14.220490 11.998833 16.118330 75.664467   100
#>     keep() 1.987287  2.142959  2.453322  2.238526  2.380653  6.861493   100
#>  discard() 1.040819  1.152247  1.350544  1.191882  1.265033  6.042629   100

Created on 2020-08-17 by the reprex package (v0.3.0)