Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.62k stars 985 forks source link

Shift by reference #6303

Open NicChr opened 4 months ago

NicChr commented 4 months ago

Hello, firstly I'm a huge fan of data.table and its set functions which is what inspired me to write a lag function that can lag/lead vectors and data frames by reference (i.e no copies).

My question is: would this functionality be desirable and better suited for data.table? I'm sure there's a case to be made that this can drastically improve performance, especially for shifting large data.tables.

A small example to showcase that it works as intended.

library(cheapr)
library(data.table)

dt <- data.table(x = seq(1, 10^6, 1),
                 y = rep_len(letters, 10^6))

dt_lag1 <- setDT(shift(dt, 3))
setnames(dt_lag1, c("x", "y"))
dt # Original dt unchanged (as expected)
#>                x      y
#>            <num> <char>
#>       1:       1      a
#>       2:       2      b
#>       3:       3      c
#>       4:       4      d
#>       5:       5      e
#>      ---               
#>  999996:  999996      j
#>  999997:  999997      k
#>  999998:  999998      l
#>  999999:  999999      m
#> 1000000: 1000000      n

dt_lag2 <- lag_(dt, 3, set = TRUE)
dt # Shifted by reference
#>               x      y
#>           <num> <char>
#>       1:     NA   <NA>
#>       2:     NA   <NA>
#>       3:     NA   <NA>
#>       4:      1      a
#>       5:      2      b
#>      ---              
#>  999996: 999993      g
#>  999997: 999994      h
#>  999998: 999995      i
#>  999999: 999996      j
#> 1000000: 999997      k

identical(dt_lag1, dt_lag2)
#> [1] TRUE

In the above I compare the output of data.table::shift() with cheapr::lag_() to verify they both produce the same thing.

The C code for lagging by reference can be found here: https://github.com/NicChr/cheapr/blob/main/src/lag.cpp Basically the way I do it is by creating a temporary vector equal to the size of the lag you want and keep updating that as we move through the vector so that the necessary data is kept to lag by reference. Since most lags are typically small, this turns out to be quite memory efficient.

A small caveat is that this currently works only for non-cyclic lags and leads.

Let me know if this is of interest to the data.table package and I'd be happy to help with implementation if so, thanks!

ben-schwen commented 4 months ago

Hi Nick, thanks for your report.

Could you please elaborate how your approach is different to dt[, names(.SD) := lapply(.SD, shift, 3, type="lag")]. Could you also provide benchmarks whether your approach is faster/more memory efficient? We already use memmove for shift so it will be hard to beat that.

Note that shift also has its own per group optimized operation so any change should also be faster for many groups

NicChr commented 4 months ago

Hi Ben, happy to provide some more detail. I may be misunderstanding the way shift works but it's my understanding that lag_() is similar to setnafill() but instead of filling NA values in place, the vector values are lagged in place. shift() on the other hand seems to allocate new vectors.

Please see the below benchmark for a comparison. I took copies of dt so the outputs can be repeatedly compared in the comparison function.

library(data.table)
library(cheapr)
library(bench)

packageVersion("data.table")
#> [1] '1.15.99'
packageVersion("cheapr")
#> [1] '0.9.2'

dt <- data.table(x = seq(1, 10^8, 1),
                 y = rep_len(letters, 10^8))

format(object.size(dt), "auto") # Size of copy
#> [1] "1.5 Gb"
mark(
  alternative = lag_(copy(dt), n = 3, set = TRUE),
  DT = copy(dt)[, names(.SD) := lapply(.SD, shift, 3, type="lag"), .SDcols = names(dt)],
  iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 alternative    1.56s    1.75s     0.570    1.49GB    0.456
#> 2 DT             2.84s    3.06s     0.329    4.47GB    0.461

Subtracting the 1.5GB for the copy, the DT method uses ~ 3GB and cheapr ~ 0GB. The alternative is slightly faster here but not by much, what is significant is that cheapr::lag_(set = T) doesn't allocate much additional memory for the operation.

In terms of a grouped approach, I have a similar function in another package timeplyr, timeplyr::roll_lag() which uses a group-optimised approach through the use of cheapr::lag2_() . It calculates the order of groups and the group sizes (using collapse) and feeds that to lag2_() though I'm sure the order vector and group sizes vector can be efficiently calculated using data.table as well.

The big caveat with lag2_() is that it can't update by reference, though it has the advantage that it can accept a vector of lags, an optional ordering vector to run through the lags, and a vector of run lengths such that it 'resets' after every run length.

dt[, group := sample(c("a", "b", "c"), 10^8, TRUE)]
dt[, group2 := sample.int(10^6, 10^8, TRUE)]

library(timeplyr)

mark(
  DT = copy(dt)[, names(.SD) := lapply(.SD, shift, 3, type="lag"), .SDcols = names(dt), by = group],
  alternative = roll_lag(copy(dt), 3, g = dt$group),
  iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 DT             7.32s    7.34s     0.104   10.31GB    0.229
#> 2 alternative    3.95s    4.22s     0.233    5.59GB    0.233

mark(
  DT = copy(dt)[, names(.SD) := lapply(.SD, shift, 3, type="lag"), .SDcols = names(dt), by = group2],
  alternative = roll_lag(copy(dt), 3, g = dt$group2),
  iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 DT             22.2s    23.6s    0.0409    8.97GB   0.0900
#> 2 alternative      16s    17.2s    0.0581    5.59GB   0.0697

# Checking that GForce is on
options(datatable.verbose = TRUE)
copy(dt)[, names(.SD) := lapply(.SD, shift, 3, type="lag"), .SDcols = names(dt), by = group]
#> Finding groups using forderv ... forder.c received 100000000 rows and 1 columns
#> 0.280s elapsed (1.020s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 3
#> 0.000s elapsed (0.000s cpu) 
#> lapply optimization changed j from 'lapply(.SD, shift, 3, type = "lag")' to 'list(shift(x, 3, type = "lag"), shift(y, 3, type = "lag"), shift(group, 3, type = "lag"), shift(group2, 3, type = "lag"))'
#> GForce optimized j to 'list(gshift(x, 3, type = "lag"), gshift(y, 3, type = "lag"), gshift(group, 3, type = "lag"), gshift(group2, 3, type = "lag"))' (see ?GForce)
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.057
#> gforce assign high and low took 0.992
#> gforce eval took 2.352
#> 3.920s elapsed (7.360s cpu) 
#> Assigning to 100000000 row subset of 100000000 rows
#> RHS_list_of_columns == true
copy(dt)[, names(.SD) := lapply(.SD, shift, 3, type="lag"), .SDcols = names(dt), by = group2]
#> Finding groups using forderv ... forder.c received 100000000 rows and 1 columns
#> 1.280s elapsed (1.510s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.020s elapsed (0.000s cpu) 
#> Getting back original order ... forder.c received a vector type 'integer' length 1000000
#> 0.030s elapsed (0.040s cpu) 
#> lapply optimization changed j from 'lapply(.SD, shift, 3, type = "lag")' to 'list(shift(x, 3, type = "lag"), shift(y, 3, type = "lag"), shift(group, 3, type = "lag"), shift(group2, 3, type = "lag"))'
#> GForce optimized j to 'list(gshift(x, 3, type = "lag"), gshift(y, 3, type = "lag"), gshift(group, 3, type = "lag"), gshift(group2, 3, type = "lag"))' (see ?GForce)
#> Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.084
#> gforce assign high and low took 1.847
#> gforce eval took 8.334
#> 10.4s elapsed (11.6s cpu) 
#> Assigning to 100000000 row subset of 100000000 rows
#> RHS_list_of_columns == true

Created on 2024-07-21 with reprex v2.1.0

The alternative by-group here is a bit faster and uses less memory. Could be I'm misunderstanding something significant here but look forward to your comments.

ben-schwen commented 4 months ago

Hey Nick,

thanks for the extensive post. I agree that its not easy to provide fair benchmarks but we should at least try do so. In the case of the first example this would correspond to either use lag_ within j of DT[i,j,by] or to use shift outside e.g.

library(data.table)
library(cheapr)
library(bench)

N = 1e8
dt <- data.table(x = seq(1, N, 1),
                 y = rep_len(letters, N))

mark(
  alternative = copy(dt)[, lag_(.SD, n=3, set=TRUE)],
  DT = copy(dt)[, names(.SD) := lapply(.SD, shift, 3, type="lag")],
  iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 alternative     1.5s    1.56s     0.642    3.03GB    0.642
#> 2 DT             1.35s    1.46s     0.700    4.47GB    1.12

mark(
  alternative = lag_(copy(dt), n=3),
  setNames(setDT(shift(copy(dt), n=3)), c("x", "y")),
  iterations = 5 
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression                           min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                         <bch> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 "alternative"                      766ms    1.05s      1.00    2.98GB     1.40
#> 2 "setNames(setDT(shift(copy(dt), n… 859ms 957.11ms      1.04    2.98GB     1.25

otherwise we merely benchmark the overhead of evaluing something inside a data.table. Same holds for the 2nd example.

jangorecki commented 4 months ago

shift by reference avoid data copy and can remedy for OOM error when working with data that barely fit into RAM. +1

For the moment you can just use set in a loop over columns to shift, and shift column by column.

MichaelChirico commented 4 months ago

For the moment you can just use set in a loop over columns to shift, and shift column by column.

Yes, otherwise the full RHS of := must be put in memory first: #4225

avoid data copy and can remedy for OOM error when working with data that barely fit into RAM.

This is true of most operations, though, so I think we should reserve the effort for high-frequency examples like setnafill(), very common operation.

I only see 40 cases of lapply(.SD, shift on GitHub: https://github.com/search?q=lang%3AR+%2Flapply%5C%28%5Cs*%5B.%5DSD%2C%5Cs*shift%5Cb%2F&type=code

Doing pure in-memory shift, i.e. moving the pointer for the column by 1 (or n), will be very efficient but comes at the cost of potentially confusing behavior if the vector is kept anywhere else in memory. E.g.

DT = fread(...)

shiftDT = pure_reference_shift(DT)

now any updates on shiftDT will also affect DT. This has all sorts of pitfalls that we experience on this issue tracker regularly.