TysonStanley / tidyfast

Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats
187 stars 4 forks source link

unnest_dt doesn't preserve list columns #31

Closed kendonB closed 4 years ago

kendonB commented 4 years ago
make_df <- function(x){
  tibble(numbers = c(1, 2))
iris %>% group_by(Species) %>% 
  nest() %>% 
  mutate(df_col = data %>% map(make_df)) %>% 
  # Now there is a column of single data.frames as well as data
  # I'd like this to unnest like a numeric column.
#> # A tibble: 150 x 6
#> # Groups:   Species [3]
#>    Species Sepal.Length Sepal.Width Petal.Length Petal.Width df_col          
#>    <fct>          <dbl>       <dbl>        <dbl>       <dbl> <list>          
#>  1 setosa           5.1         3.5          1.4         0.2 <tibble [2 x 1]>
#>  2 setosa           4.9         3            1.4         0.2 <tibble [2 x 1]>
#>  3 setosa           4.7         3.2          1.3         0.2 <tibble [2 x 1]>
#>  4 setosa           4.6         3.1          1.5         0.2 <tibble [2 x 1]>
#>  5 setosa           5           3.6          1.4         0.2 <tibble [2 x 1]>
#>  6 setosa           5.4         3.9          1.7         0.4 <tibble [2 x 1]>
#>  7 setosa           4.6         3.4          1.4         0.3 <tibble [2 x 1]>
#>  8 setosa           5           3.4          1.5         0.2 <tibble [2 x 1]>
#>  9 setosa           4.4         2.9          1.4         0.2 <tibble [2 x 1]>
#> 10 setosa           4.9         3.1          1.5         0.1 <tibble [2 x 1]>
#> # ... with 140 more rows

iris %>% group_by(Species) %>% 
  nest() %>% 
  mutate(df_col = data %>% map(make_df)) %>% 
  # Now there is a column of single data.frames as well as data
  # I'd like this to unnest like a numeric column.
#>        Species Sepal.Length Sepal.Width Petal.Length Petal.Width
#>   1:    setosa          5.1         3.5          1.4         0.2
#>   2:    setosa          4.9         3.0          1.4         0.2
#>   3:    setosa          4.7         3.2          1.3         0.2
#>   4:    setosa          4.6         3.1          1.5         0.2
#>   5:    setosa          5.0         3.6          1.4         0.2
#>  ---                                                            
#> 146: virginica          6.7         3.0          5.2         2.3
#> 147: virginica          6.3         2.5          5.0         1.9
#> 148: virginica          6.5         3.0          5.2         2.0
#> 149: virginica          6.2         3.4          5.4         2.3
#> 150: virginica          5.9         3.0          5.1         1.8

Created on 2020-08-07 by the reprex package (v0.3.0)

vincentarelbundock commented 4 years ago

One approach would be to merge the original data back in after unnesting. There might also be ways to improve performance somewhat before doing this. For instance,

dat = data.table(group=sample(1:500, 1e4, replace=TRUE),
fit = function(k) lm(y ~ x, k)
dat = dat[, .(data=.(.SD)), by=group][
          , model := lapply(data, fit)][
          , result := lapply(model, tidy)]

tf = function() dt_unnest(dat, result)
keep = function(x) dat[, result[[1L]], by=group][dat, on='group']
discard = function(x) dat[, result[[1L]], by=group]

all(tf() == discard())
#> [1] TRUE

microbenchmark(tf(), keep(), discard(),
#> Unit: milliseconds
#>       expr      min        lq      mean    median        uq       max neval
#>       tf() 9.661796 10.762439 14.220490 11.998833 16.118330 75.664467   100
#>     keep() 1.987287  2.142959  2.453322  2.238526  2.380653  6.861493   100
#>  discard() 1.040819  1.152247  1.350544  1.191882  1.265033  6.042629   100

Created on 2020-08-17 by the reprex package (v0.3.0)