TysonStanley / tidyfast

Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats
https://tysonbarrett.com/tidyfast/
187 stars 4 forks source link

Propose fill args in dt_unnest() #25

Open leungi opened 4 years ago

leungi commented 4 years ago

Reprex and proposal below.

library(tidyfast)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

# |- data ----
dat <- structure(
  list(
    id = c("11", "22"),
    phase = c("a", "b"),
    values = list(
      structure(
        list(
          a = 0.0584563566053344,
          b = 192,
          c = "50%",
          d = 1,
          e = 0,
          f = 0,
          g = 0
        ),
        row.names = c(NA, -1L),
        class = c("tbl_df",
                  "tbl", "data.frame")
      ),
      structure(
        list(
          c = "50%",
          d = 465L,
          e = 0,
          g = 290514.430137519,
          b = 10961.9288476965,
          a = 0.359973896295374,
          h = 1.46588348984196,
          f = 119.108387941727
        ),
        row.names = c(NA,
                      -1L),
        class = c("tbl_df", "tbl", "data.frame")
      )
    )
  ),
  row.names = c(NA,
                -2L),
  class = c("tbl_df", "tbl", "data.frame")
)

# |- current ----
dat %>% 
  tidyfast::dt_unnest(values)
#> Error in rbindlist(eval(col)): Item 2 has 8 columns, inconsistent with item 1 which has 7 columns. To fill missing columns use fill=TRUE.

# |- proposed ----
dt_unnest.default_edit <- function(dt_, col, fill = FALSE, ...){
  if (isFALSE(data.table::is.data.table(dt_)))
    dt_ <- data.table::as.data.table(dt_)

  col    <- substitute(col)
  keep   <- substitute(alist(...))
  print(keep)
  names  <- colnames(dt_)
  others <- names[-match(paste(col), names)]
  rows   <- sapply(dt_[[paste(col)]], NROW)

  if (length(keep) > 1)
    others <- others[others %in% paste(keep)[-1]]

  others_dt <- dt_[, ..others]
  classes   <- sapply(others_dt, typeof)
  keep      <- names(classes)[classes != "list"]
  others_dt <- others_dt[, ..keep]
  others_dt <- lapply(others_dt, rep, times = rows)

  dt_[, list(data.table::as.data.table(others_dt),
             data.table::rbindlist(eval(col),
                                   fill = fill))]
}

dat %>% 
  dt_unnest.default_edit(values, fill = TRUE)
#> alist()
#>    id phase          a        b   c   d e        f        g        h
#> 1: 11     a 0.05845636   192.00 50%   1 0   0.0000      0.0       NA
#> 2: 22     b 0.35997390 10961.93 50% 465 0 119.1084 290514.4 1.465883

Created on 2020-04-05 by the reprex package (v0.3.0)

TysonStanley commented 4 years ago

I like this idea! Definitely is a natural extension. If you want, feel free to do a pull request with this and I'll merge it and add you to the contributor list.

markfairbanks commented 4 years ago

@TysonStanley The new version of dt_unnest() causes this feature to no longer work. Should this be reopened?

pacman::p_load(tidyfast, data.table, magrittr)

df1 <- data.table(a = "a", b = 1)
df2 <- data.table(a = rep("a", 3), b = 1:3, c = 1:3)

nested_df <- data.table(id = 1:2,
                        list_col = list(df1, df2))

nested_df %>%
  dt_unnest(list_col)
#> Error in `[.data.table`(dt_, , eval(col)[[1L]], by = others): j doesn't evaluate to the same number of columns for each group
TysonStanley commented 4 years ago

That is interesting... That was one advantage to using rbindlist() but if possible, I really want to use the [[ approach. Any ideas?

markfairbanks commented 4 years ago

Maybe extract the list column and check if the nested data.tables have a consistent number of columns?

df1 <- data.table(a = "a", b = 1)
df2 <- data.table(a = rep("a", 3), b = 1:3, c = 1:3)

test_list <- list(df1, df2)

if (length(unique(lengths(test_list))) > 1) {
  "rbindlist code"
} else {
  "[[1]] code"
}
#> [1] "rbindlist code"
TysonStanley commented 4 years ago

Yeah, I was thinking something similar. I can't find anything with the [[ in data.table that we could change. The issue with this approach is the additional cost of getting the lengths, especially if it is really large data... I wonder how often this is. @leungi is this something you encounter a lot?

leungi commented 4 years ago

@TysonStanley @markfairbanks : thanks for bringing this up again.

I do encounter this quite often as a result of map_*() workflow for parsing large volume of messy semi-tabular data, where column names, ncol varies. Being able to bind everything and then remove non-informative columns based on amount of parsed data (post-binding) has been very effective.