TysonStanley / tidyfast

Fast and efficient alternatives to tidyr functions built on data.table #rdatatable #rstats
https://tysonbarrett.com/tidyfast/
187 stars 4 forks source link

dt_separate #48

Open SebKrantz opened 3 years ago

SebKrantz commented 3 years ago

Hi Tyson,

I have been playing around a bit with your package. I like that for the most part it works fine and that it really only depends on data.table and Rcpp. I have found that dt_separate with tidyfast 0.21 from CRAN does not work for my use case. I have not tried the development version yet.

library(tidyfast)
library(data.table)
data <- readRDS(url("https://shiny.rstudio.com/tutorial/written-tutorial/lesson5/census-app/data/counties.rds"))
head(data)
#>              name total.pop white black hispanic asian
#> 1 alabama,autauga     54571  77.2  19.3      2.4   0.9
#> 2 alabama,baldwin    182265  83.5  10.9      4.4   0.7
#> 3 alabama,barbour     27457  46.8  47.8      5.1   0.4
#> 4    alabama,bibb     22915  75.0  22.9      1.8   0.1
#> 5  alabama,blount     57322  88.9   2.5      8.1   0.2
#> 6 alabama,bullock     10914  21.9  71.0      7.1   0.2
setDT(data)
# Apply separate
dt_separate(data, name, c("state", "county"))
# Nothing
head(data)
#>               name total.pop white black hispanic asian
#> 1: alabama,autauga     54571  77.2  19.3      2.4   0.9
#> 2: alabama,baldwin    182265  83.5  10.9      4.4   0.7
#> 3: alabama,barbour     27457  46.8  47.8      5.1   0.4
#> 4:    alabama,bibb     22915  75.0  22.9      1.8   0.1
#> 5:  alabama,blount     57322  88.9   2.5      8.1   0.2
#> 6: alabama,bullock     10914  21.9  71.0      7.1   0.2

# Tidyr works fine 
tidyr::separate(data, name, c("state", "county"))
#> Warning: Expected 2 pieces. Additional pieces discarded in 645 rows [25, 58, 79,
#> 111, 122, 143, 152, 163, 164, 165, 175, 191, 192, 193, 194, 195, 196, 197, 198,
#> 199, ...].
#>         state   county total.pop white black hispanic asian
#>    1: alabama  autauga     54571  77.2  19.3      2.4   0.9
#>    2: alabama  baldwin    182265  83.5  10.9      4.4   0.7
#>    3: alabama  barbour     27457  46.8  47.8      5.1   0.4
#>    4: alabama     bibb     22915  75.0  22.9      1.8   0.1
#>    5: alabama   blount     57322  88.9   2.5      8.1   0.2
#>   ---                                                      
#> 3078: wyoming    teton     21294  82.2   1.9     15.0   1.1
#> 3079: wyoming    uinta     21118  88.5   2.3      8.8   0.3
#> 3080: wyoming washakie      8533  83.9   2.6     13.6   0.6
#> 3081: wyoming   weston      7208  93.8   2.0      3.0   0.3
#> 3082:     new   mexico     76569  36.2   5.4     58.3   0.5

# This streagely appears to duplicate the information
dt_separate(data, name, c("state", "county"), immutable = FALSE)
head(data)
#>    total.pop white black hispanic asian           state          county
#> 1:     54571  77.2  19.3      2.4   0.9 alabama,autauga alabama,autauga
#> 2:    182265  83.5  10.9      4.4   0.7 alabama,baldwin alabama,baldwin
#> 3:     27457  46.8  47.8      5.1   0.4 alabama,barbour alabama,barbour
#> 4:     22915  75.0  22.9      1.8   0.1    alabama,bibb    alabama,bibb
#> 5:     57322  88.9   2.5      8.1   0.2  alabama,blount  alabama,blount
#> 6:     10914  21.9  71.0      7.1   0.2 alabama,bullock alabama,bullock

Created on 2021-08-02 by the reprex package (v0.3.0)

Then, I see you use data.table::copy by default if immutable = FALSE. This appears to me highly inefficient as copy copies the whole table where base R's copy on modify would be much more efficient. In general, I think there is somewhat an issue around data.table having propagated the notion that copies in base R are inefficient. They are not anymore since R 3.5.0 where shallow copies were introduced in base R. To prove the point:

library(data.table)
library(collapse)
library(microbenchmark)

dat <- qDT(list(x = rnorm(1e8)))  
# Returns a shallow copy, function is written entirely in base R
microbenchmark(ftransform(dat, y = x + 1), times = 5)
#> Unit: milliseconds
#>                        expr      min       lq     mean   median      uq
#>  ftransform(dat, y = x + 1) 540.7881 563.6548 630.8815 579.5532 615.984
#>       max neval
#>  854.4274     5

# Modify by reference
microbenchmark(dat[, y := x + 1], times = 5)
#> Unit: milliseconds
#>                   expr      min       lq     mean   median       uq      max
#>  dat[, `:=`(y, x + 1)] 464.3758 629.7212 664.0154 664.0158 724.2668 837.6971
#>  neval
#>      5

# The cost of a shallow copy in base R:
tracemem(dat)  # Tracing memory
#> [1] "<0000000007A66F20>"
# This makes a shallow copy
oldClass(dat) <- c("data.table", "data.frame")
#> tracemem[0x0000000007a66f20 -> 0x00000000101ae658]
dat[, y := x + 1] # Data.table also found it (gives a warning, I don't know why it is not shown here)
untracemem(dat)
# Let's benchmark this
v <- c("data.table", "data.frame")
microbenchmark(oldClass(dat) <- v)
#> Unit: microseconds
#>                expr   min    lq    mean median    uq    max neval
#>  oldClass(dat) <- v 1.338 1.339 1.85214  1.785 1.785 22.758   100

# This creates two shallow copies + overallocatng of 100 columns (to be able to trick data.table into thinking 
# the table was not copied and to be able to add columns by reference into empty column pointers using := afterwards)
tracemem(dat)
#> [1] "<000000001059BB98>"
dat <- qDT(dat)
#> tracemem[0x000000001059bb98 -> 0x00000000101ae118]: qDT_raw alc qDT 
#> tracemem[0x00000000101ae118 -> 0x00000000101ae218]: qDT_raw alc qDT 
dat[, y := x + 1] # Allows me to do this without a warning. 
untracemem(dat)
# Cost: 
microbenchmark(dat <- qDT(dat))
#> Unit: microseconds
#>             expr   min    lq     mean median    uq     max neval
#>  dat <- qDF(dat) 4.462 4.909 13.88302 5.8015 9.371 655.538   100

# Or better: 
alc <- collapse:::alc
microbenchmark(dat <- alc(dat))
#> Unit: microseconds
#>             expr   min    lq    mean median    uq    max neval
#>  dat <- alc(dat) 2.231 2.677 3.01686  2.678 2.678 30.791   100

Created on 2021-08-02 by the reprex package (v0.3.0)

In fact, Matt will probably disagree with me for one reason or another, but if I were to redesign data.table with R 3.5.0 in place, I'd get rid of the whole mechanisms avoiding shallow copies (through overallocated data.tables, ".internal.selfref" attributes etc, and just focus on avoiding deep copies. I believe the only place were there are significant gains from avoiding shallow copies in R are inside tight loops such as the example given here in looping data.frame subsets (I think even in that case [[.data.frame itself probably takes out much more speed than the shallow copies it creates). So in summary: I think doing this without data.table::copy will be much faster at any data size.

Finally, talking about loops, I just had a brief glance at the C code. In lines 52-62 of fill.cpp: if you have no particular reason to use STRING_ELT every time, I'd also create string pointers SEXP* xin = STRING_PTR(x); SEXP* xout = STRING_PTR(out); and then index the pointers as in the other loops. STRING_ELT creates a pointer every time and uses it to subset, so you can simply outsource that step from the loop with STRING_PTR.