Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.59k stars 978 forks source link

Create an efficient unnest function #2146

Open statquant opened 7 years ago

statquant commented 7 years ago

Similarly to tidyr:::unnest data.table would benefits from a fast unnest function. I found

is there something canonical ? if not that a FR !

MichaelChirico commented 7 years ago

Also:

https://stackoverflow.com/questions/44336733/ https://stackoverflow.com/q/48831637/3576984


Suggestion to make such a function more flexible than unnest as well; building on the example in the linked SO Q, consider:

dt1 <- data.table(
  colA=   c('A1','A2','A3'), 
  colB=list('B1',c('B2a','B2b'),'B3'),
  colC=list(c('C1a', 'C1b'),'C2','C3'), 
  colD=   c('D1','D2','D3')
)

tidyr::unnest(dt1) is an error, but a "cross-join" at rows with mis-matched lengths is probably appropriate.

arunsrinivasan commented 7 years ago

Yes, this would be useful, agree. As I wrote under https://github.com/Rdatatable/data.table/issues/2159, I remember implementing unwrap() sometime ago, and I prefer it to unnest, if that's all fine. Marking #2159 as duplicate.

franknarf1 commented 5 years ago

Some other possible extensions / convenience features from this SO q: https://stackoverflow.com/q/56981960

# example with a list of nested DFs
library(data.table)
DT <- data.table(colA=   c('A1','A2','A3'), 
    colB=list(data.frame(), 
        data.frame(colsubB1=c('B2a','B2b'),colsubB2=c('B2c', 'B2d')), 
        data.frame(colsubB1=c('A3a','A3b'),colsubB2=c('A3c', 'A3d'))),
    colC=   c('C1','C2','C3'), 
    colD=   c('D1','D2','D3')

DT[, lens := sapply(colB, nrow)]

#    colA         colB colC colD lens
# 1:   A1 <data.frame>   C1   D1    0
# 2:   A2 <data.frame>   C2   D2    2
# 3:   A3 <data.frame>   C3   D3    2

# desired output...
# preserves length == 0 elements filled with NA (instead of dropping)
# drops the columns in place (instead of at the end)

#    colA colsubB1 colsubB2 colC colD lens
# 1:   A1     <NA>     <NA>   C1   D1    0
# 2:   A2      B2a      B2c   C2   D2    2
# 3:   A2      B2b      B2d   C2   D2    2
# 4:   A3      A3a      A3c   C3   D3    2
# 5:   A3      A3b      A3d   C3   D3    2

# versus tidyr::unnest
tidyr::unnest(DT)
#    colA colC colD lens colsubB1 colsubB2
# 1:   A2   C2   D2    2      B2a      B2c
# 2:   A2   C2   D2    2      B2b      B2d
# 3:   A3   C3   D3    2      A3a      A3c
# 4:   A3   C3   D3    2      A3b      A3d
r2evans commented 4 years ago

As a mod to @franknarf1 's code, the empty row can be preserved

tidyr::unnest(DT, colB, keep_empty = TRUE)
# # A tibble: 5 x 5
#   colA  colsubB1 colsubB2 colC  colD 
#   <chr> <chr>    <chr>    <chr> <chr>
# 1 A1    <NA>     <NA>     C1    D1   
# 2 A2    B2a      B2c      C2    D2   
# 3 A2    B2b      B2d      C2    D2   
# 4 A3    A3a      A3c      C3    D3   
# 5 A3    A3b      A3d      C3    D3   
aourednik commented 2 years ago

This seems to work quite efficiently (my test with 1 million rows of similar structure took about a minute)

dt <- data.table(A=c(1,2,3),B=list(c("A","B","C"),"D",c("E","F"))
dt.flat <- dt[,unlist(B),by=A]
m-muecke commented 2 months ago

For reference the unnest implementation in mlr3misc: https://mlr3misc.mlr-org.com/reference/unnest.html