Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.57k stars 974 forks source link

implement frev() - base::rev() that allocates less #5885

Open jangorecki opened 8 months ago

jangorecki commented 8 months ago
          Since we know now how costly negation is, what about reverting in-place when possible? Reverting with copy costs 1/2  of `rev`. Reverting in-place costs 1/10 of `rev` (no parallelization).
x = rnorm(1e8)
system.time(t1 <- rev(x))
#   user  system elapsed 
#  0.481   0.212   0.693 
system.time(t2 <- reverse(x, copy=TRUE))
#   user  system elapsed 
#  0.117   0.216   0.334 
system.time(t3 <- reverse(x, copy=FALSE))
#   user  system elapsed 
#  0.060   0.000   0.061 
identical(t1, t2)
identical(t1, t3)

x = rnorm(1e8)
n = rep(1e2, 1e8)
system.time(a1 <- frollmax(x, n, align="left", adaptive=TRUE))
#    user  system elapsed 
#  14.330   3.147   7.669
system.time(a2 <- reverse(frollmax(reverse(x, copy=TRUE), reverse(n, copy=TRUE), align="right", adaptive=TRUE), copy=FALSE))
#    user  system elapsed 
#   9.092   0.807   2.687 
identical(a1, a2)

_Originally posted by @ben-schwen in https://github.com/Rdatatable/data.table/pull/5441#discussion_r1443843574_

jangorecki commented 8 months ago

And possibly many functions as well. whenever rev() is used we could cut its time to 1/2 or even 1/10 when it's safe to rev in place.

jangorecki commented 8 months ago

I turned the issue into more generic one. We could always save at least one allocation, of indices, and sometimes even of the results.

MichaelChirico commented 8 months ago

Noting a related r-devel bug to support ALTREP for rev():

https://bugs.r-project.org/show_bug.cgi?id=18406

MichaelChirico commented 8 months ago

We can use {lintr} to quickly identify usages inside the package (with more confidence over grep since {lintr} is AST-aware):

lintr::lint_package(linters = lintr::undesirable_function_linter(c(rev = NA)))
R/as.data.table.R:37:9: style: [undesirable_function_linter] Function "rev" is undesirable.
  val = rev(dimnames(provideDimnames(x)))
        ^~~
R/as.data.table.R:39:39: style: [undesirable_function_linter] Function "rev" is undesirable.
    setattr(val, 'names', paste0("V", rev(seq_along(val))))
                                      ^~~
R/as.data.table.R:41:22: style: [undesirable_function_linter] Function "rev" is undesirable.
  setcolorder(ans, c(rev(head(names(ans), -1L)), "N"))
                     ^~~
R/as.data.table.R:102:9: style: [undesirable_function_linter] Function "rev" is undesirable.
  val = rev(val)
        ^~~
R/as.data.table.R:104:39: style: [undesirable_function_linter] Function "rev" is undesirable.
    setattr(val, 'names', paste0("V", rev(seq_along(val))))
                                      ^~~
R/as.data.table.R:106:93: style: [undesirable_function_linter] Function "rev" is undesirable.
    stop("Argument 'value.name' should not overlap with column names in result: ", brackify(rev(names(val))))
                                                                                            ^~~
R/as.data.table.R:112:10: style: [undesirable_function_linter] Function "rev" is undesirable.
  dims = rev(head(names(ans), -1L))
         ^~~
R/bmerge.R:92:100: style: [undesirable_function_linter] Function "rev" is undesirable.
      if (xclass=="integer64") { w=i; wc=ic; wclass=iclass; } else { w=x; wc=xc; wclass=xclass; nm=rev(nm) }  # w is which to coerce
                                                                                                   ^~~
R/utils.R:43:27: style: [undesirable_function_linter] Function "rev" is undesirable.
  length(x) - match(TRUE, rev(x)) + 1L
                          ^~~
tdhock commented 8 months ago

@anirban166 @DorisAmoakohene please investigate the memory savings using atime