I have been playing around a bit with your package. I like that for the most part it works fine and that it really only depends on data.table and Rcpp. I have found that dt_separate with tidyfast 0.21 from CRAN does not work for my use case. I have not tried the development version yet.
Then, I see you use data.table::copy by default if immutable = FALSE. This appears to me highly inefficient as copy copies the whole table where base R's copy on modify would be much more efficient. In general, I think there is somewhat an issue around data.table having propagated the notion that copies in base R are inefficient. They are not anymore since R 3.5.0 where shallow copies were introduced in base R. To prove the point:
library(data.table)
library(collapse)
library(microbenchmark)
dat <- qDT(list(x = rnorm(1e8)))
# Returns a shallow copy, function is written entirely in base R
microbenchmark(ftransform(dat, y = x + 1), times = 5)
#> Unit: milliseconds
#> expr min lq mean median uq
#> ftransform(dat, y = x + 1) 540.7881 563.6548 630.8815 579.5532 615.984
#> max neval
#> 854.4274 5
# Modify by reference
microbenchmark(dat[, y := x + 1], times = 5)
#> Unit: milliseconds
#> expr min lq mean median uq max
#> dat[, `:=`(y, x + 1)] 464.3758 629.7212 664.0154 664.0158 724.2668 837.6971
#> neval
#> 5
# The cost of a shallow copy in base R:
tracemem(dat) # Tracing memory
#> [1] "<0000000007A66F20>"
# This makes a shallow copy
oldClass(dat) <- c("data.table", "data.frame")
#> tracemem[0x0000000007a66f20 -> 0x00000000101ae658]
dat[, y := x + 1] # Data.table also found it (gives a warning, I don't know why it is not shown here)
untracemem(dat)
# Let's benchmark this
v <- c("data.table", "data.frame")
microbenchmark(oldClass(dat) <- v)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> oldClass(dat) <- v 1.338 1.339 1.85214 1.785 1.785 22.758 100
# This creates two shallow copies + overallocatng of 100 columns (to be able to trick data.table into thinking
# the table was not copied and to be able to add columns by reference into empty column pointers using := afterwards)
tracemem(dat)
#> [1] "<000000001059BB98>"
dat <- qDT(dat)
#> tracemem[0x000000001059bb98 -> 0x00000000101ae118]: qDT_raw alc qDT
#> tracemem[0x00000000101ae118 -> 0x00000000101ae218]: qDT_raw alc qDT
dat[, y := x + 1] # Allows me to do this without a warning.
untracemem(dat)
# Cost:
microbenchmark(dat <- qDT(dat))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> dat <- qDF(dat) 4.462 4.909 13.88302 5.8015 9.371 655.538 100
# Or better:
alc <- collapse:::alc
microbenchmark(dat <- alc(dat))
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> dat <- alc(dat) 2.231 2.677 3.01686 2.678 2.678 30.791 100
In fact, Matt will probably disagree with me for one reason or another, but if I were to redesign data.table with R 3.5.0 in place, I'd get rid of the whole mechanisms avoiding shallow copies (through overallocated data.tables, ".internal.selfref" attributes etc, and just focus on avoiding deep copies. I believe the only place were there are significant gains from avoiding shallow copies in R are inside tight loops such as the example given here in looping data.frame subsets (I think even in that case [[.data.frame itself probably takes out much more speed than the shallow copies it creates). So in summary: I think doing this without data.table::copy will be much faster at any data size.
Finally, talking about loops, I just had a brief glance at the C code. In lines 52-62 of fill.cpp: if you have no particular reason to use STRING_ELT every time, I'd also create string pointers SEXP* xin = STRING_PTR(x); SEXP* xout = STRING_PTR(out); and then index the pointers as in the other loops. STRING_ELT creates a pointer every time and uses it to subset, so you can simply outsource that step from the loop with STRING_PTR.
Hi Tyson,
I have been playing around a bit with your package. I like that for the most part it works fine and that it really only depends on data.table and Rcpp. I have found that
dt_separate
with tidyfast 0.21 from CRAN does not work for my use case. I have not tried the development version yet.Created on 2021-08-02 by the reprex package (v0.3.0)
Then, I see you use
data.table::copy
by default ifimmutable = FALSE
. This appears to me highly inefficient ascopy
copies the whole table where base R's copy on modify would be much more efficient. In general, I think there is somewhat an issue around data.table having propagated the notion that copies in base R are inefficient. They are not anymore since R 3.5.0 where shallow copies were introduced in base R. To prove the point:Created on 2021-08-02 by the reprex package (v0.3.0)
In fact, Matt will probably disagree with me for one reason or another, but if I were to redesign data.table with R 3.5.0 in place, I'd get rid of the whole mechanisms avoiding shallow copies (through overallocated data.tables, ".internal.selfref" attributes etc, and just focus on avoiding deep copies. I believe the only place were there are significant gains from avoiding shallow copies in R are inside tight loops such as the example given here in looping data.frame subsets (I think even in that case
[[.data.frame
itself probably takes out much more speed than the shallow copies it creates). So in summary: I think doing this withoutdata.table::copy
will be much faster at any data size.Finally, talking about loops, I just had a brief glance at the C code. In lines 52-62 of fill.cpp: if you have no particular reason to use
STRING_ELT
every time, I'd also create string pointersSEXP* xin = STRING_PTR(x); SEXP* xout = STRING_PTR(out);
and then index the pointers as in the other loops.STRING_ELT
creates a pointer every time and uses it to subset, so you can simply outsource that step from the loop withSTRING_PTR
.