HenrikBengtsson / Wishlist-for-R

Features and tweaks to R that I and others would love to see - feel free to add yours!
https://github.com/HenrikBengtsson/Wishlist-for-R/issues
GNU Lesser General Public License v3.0
133 stars 4 forks source link

WISH: Control over rownames creation in data.frame subsetting (massive speed-up) #164

Open mayer79 opened 3 months ago

mayer79 commented 3 months ago

In constrast to matrices, replicating rows in data.frames is very slow. The bottleneck is the check/creation of unique rownames. In many situations, one does not care about the latter and it would be convenient to pass a ignore.row.names = TRUE argument to the subsetting operation [.data.frame.

Example:

library(bench)

df = iris[1:4]
M = data.matrix(df)

row_id = rep(1:150, each = 1000)

fast_row_subset_df <- function(x, i) {
  out <- lapply(x, function(z) if (length(dim(z)) != 2L) z[i] else z[i, , drop = FALSE])
  attr(out, "row.names") <- .set_row_names(length(i))
  class(out) <- "data.frame"
  out
}

bench::mark(
  df[row_id, ],
  M[row_id, ],
  fast_row_subset_df(df, row_id),
  check = "ignore"
)

image

The API of [ could be:

`[.data.frame` <- function (x, i, j, drop = if (missing(i)) TRUE else length(cols) == 1, ignore.row.names = FALSE) {
   ...
}