Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.6k stars 981 forks source link

Speed up logical indexing with multiple conditions #4105

Open ben519 opened 4 years ago

ben519 commented 4 years ago

Have a look at this example

library(data.table)  # 1.12.6

foo <- data.table(
  x = as.character(runif(n = 10^7)),
  y = as.character(runif(n = 10^7)),
  z = as.character(runif(n = 10^7))
)

system.time(foo[like(x, "123") & like(y, "123") & like(z, "123")])
   user  system elapsed 
 11.141   0.057  11.231 

system.time(foo[like(x, "123")][like(y, "123")][like(z, "123")])
   user  system elapsed 
  3.773   0.021   3.799

As shown, I get a big speedup when I use successive logical conditions rather than and'ing them together to make a single index. This implies that data.table is checking each condition for each row rather than checking conditions lazily. Is this by design? This seems like an opportunity for a nice speedup.

jangorecki commented 4 years ago

You are not really doing single index, but more a single map (TRUE/FALSE), index would be when you would call foo[i, which=TRUE] to get integer row location. In first example you create 3 maps of 10^7 size, then AND them sequentially to single map of 10^7. As a result you materialize 5 maps of 10^7 size (because & takes only two maps at once). In second example only first map is of size 10^7, all remaining ones are smaller (in most cases). Such an optimization is not obvious because if like does satisfy condition for all rows (10^7 TRUE values), then we can might add an overhead because we will need to make a copy of all data, and then again make a map of 10^7 for the second like call. It has to be well researched what are the safe thresholds of such optimizations. Further speed up would be possible to use index rather than map (integer rows id rather than logical vector), there is a related https://github.com/Rdatatable/data.table/issues/3663 issue which might be useful for that (eventually adding which_like). ANDing index (with intersect function) is also not really cheap computation, but that could be optimized internally knowing that indexes are sorted.

st-pasha commented 4 years ago

In python we implemented short-circuiting semantics for & and | when the arguments are of logical type. Thus, if like(x, "123") evaluates to false, the second (and the third) are not evaluated at all, reducing the computation time.

MichaelChirico commented 4 years ago

I think this is #2472 redux

ben519 commented 4 years ago

@st-pasha's idea of short circuiting was my original thought.

On further reflection, NAs could cause some confusion since lazy evaluation of (TRUE | NA) would evaluate to TRUE but non-lazy evaluates to NA (which would exclude a record when logical indexing). So, using something like && and || for explicit lazy evaluation would be ideal, I think..

jangorecki commented 4 years ago

but && and || has to operate on scalar, overriding that rule inside [ would cause too much confusion

st-pasha commented 4 years ago

@ben519 The "NA" means "not available", aka "missing" value. In other words, it is some kind of value, we just don't know which. Thus, it's a random variable, a Schrodinger cat; not Galactus-devourer-of-the-worlds.

As it happens, if the hidden value of NA is TRUE, then (TRUE | TRUE) == TRUE; and if the hidden value of NA is FALSE, then (TRUE | FALSE) == TRUE. This means (TRUE | NA) == TRUE, because the result does not depend on the value of NA.

ColeMiller1 commented 4 years ago

A related StackOverflow post.

MichaelChirico commented 4 years ago

To me it makes sense for base R to have n-ary operator &(...). that would allow the short-out to happen at C level (currently the short-out happens, but only pairwise)

I don't think the current language structure would allow a&b&c to be re-routed to this (evaluation is not "lazy enough" in the sense that this will always be parsed to two iterated functions when it could be optimized to one), but if such an option existed we could potentially use NSE to do so ourselves.

On Mon, Dec 16, 2019, 10:29 AM ColeM1 notifications@github.com wrote:

A related StackOverflow post https://stackoverflow.com/questions/58557831/performance-benefits-of-chaining-over-anding-when-filtering-a-data-table/58559134#58559134 .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Rdatatable/data.table/issues/4105?email_source=notifications&email_token=AB2BA5L6YAAWJA23ZC26H5TQY3RW3A5CNFSM4J2B5W7KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEG5KJLY#issuecomment-565879983, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2BA5PUFEIJF4PZ5CPXFWTQY3RW3ANCNFSM4J2B5W7A .

jangorecki commented 4 years ago

I would add that short circuit support to #3663, then we also don't have to intersect results from multiple filter queries.

jangorecki commented 4 years ago

I pushed forward #3663, added short-circuit and like function support. like is still via R C eval() thus particular example provided by @ben519 will not really shine that nicely as for ==, !=, %in%, !%in% functions. It is in fwhich branch.

library(data.table)
fwhich = data.table:::fwhich
set.seed(108)
N = 1e8L # 35 GB mem required, 20 threads used
foo = data.table(
  x = as.character(runif(n = N)),
  y = as.character(runif(n = N)),
  z = as.character(runif(n = N))
)
invisible({foo[c(1L,N)]; foo[c(1L,N)]; foo[c(1L,N)]}) # warmup

## easy
system.time(foo[like(x, "123")][like(y, "123")][like(z, "123")])
#   user  system elapsed 
# 34.491   2.200  36.693 
system.time(foo[like(x, "123") & like(y, "123") & like(z, "123")])
#   user  system elapsed 
#102.829   6.408 109.241 
system.time(foo[fwhich(like(x, "123") & like(y, "123") & like(z, "123"))])
#   user  system elapsed 
# 33.188   1.156  34.346 

## hard
system.time(foo[like(x, "*")][like(y, "*")][like(z, "*")])
#   user  system elapsed 
# 82.554   9.927  92.484 
system.time(foo[like(x, "*") & like(y, "*") & like(z, "*")])
#   user  system elapsed 
# 41.066   4.920  45.988 
system.time(foo[fwhich(like(x, "*") & like(y, "*") & like(z, "*"))])
#   user  system elapsed 
# 74.307   8.320  82.324

## !easy
system.time(foo[!like(x, "123")][!like(y, "123")][!like(z, "123")])
#   user  system elapsed 
#151.403  15.151 166.561 
system.time(foo[!like(x, "123") & !like(y, "123") & !like(z, "123")])
#   user  system elapsed 
#109.646   9.911 119.562 
system.time(foo[fwhich(!like(x, "123") & !like(y, "123") & !like(z, "123"))])
#   user  system elapsed 
#142.395  13.652 155.758 

## !hard
system.time(foo[!like(x, "*")][!like(y, "*")][!like(z, "*")])
#   user  system elapsed 
# 10.400   0.812  11.213 
system.time(foo[!like(x, "*") & !like(y, "*") & !like(z, "*")])
#   user  system elapsed 
# 34.604   2.864  37.470 
system.time(foo[fwhich(!like(x, "*") & !like(y, "*") & !like(z, "*"))])
#   user  system elapsed 
# 10.689   0.264  10.953
ben519 commented 4 years ago

Interesting results. Any idea why fwhich() is actually slower than excluding it in 2 of those 4 tests? I expected fwhich() to be as fast or faster in all cases.

jangorecki commented 4 years ago

Very good question. These are the cases where most (or all) of rows are being returned from each filter. So ordinary way just materialize 3 logical vectors of length nrow (reduce '&' will result in 2 extra logical vectors). fwhich on the other hand will not benefit much from short-circuit because the number of rows returned is close (or equal) to nrow. Your approach is suffering exactly the same in regards to this. Because fwhich(like) uses R C api eval(grep) - grep is just what 'like' function calls - it has to return 1-based indices (and not 0-based indices if that would be C call not involving 'eval'). This 0-based indices are needed so it can mix with operators that are optimized well and don't use R C eval, allowing fwhich(x==1 & y%like%2) to be optimized as well. Because of that decoding of 1-based to 0-based happens multiple times (1-2 times for each 'like' call). If your approach suffers from that depends on R's grep internals. The difference is that in your approach you call [ three times, where fwhich will call it only once. Further improvements are possible when calling C grep directly, not via R C api eval. This is how it should be done actually. I will provide timings of another cases so it will be most clear.

ben519 commented 4 years ago

Can't say I fully understand, but I decided to test this for myself using Rcpp and sure enough, lazy evaluation is (well, can be) slower than normal evaluation.

For anyone interested,

library(Rcpp)

cppFunction('LogicalVector exhaustive_and(LogicalVector x, LogicalVector y) {
  LogicalVector result(x.size());

  for( int i = 0; i < x.size(); i++ ) {
      result[i] = x[i] & y[i];
  }

  return result;
}')

cppFunction('LogicalVector lazy_and(LogicalVector x, LogicalVector y) {
  LogicalVector result(x.size());

  for( int i = 0; i < x.size(); i++ ) {
      result[i] = x[i] && y[i];
  }

  return result;
}')

a <- sample(c(F, T), size = 10^8, replace = TRUE)
b <- sample(c(F, T), size = 10^8, replace = TRUE)

system.time(result_exhaustive <- exhaustive_and(a, b))
# user  system elapsed 
# 0.300   0.002   0.303 

system.time(result_lazy <- lazy_and(a, b))
# user  system elapsed 
# 0.786   0.004   0.793 
jangorecki commented 4 years ago

This issue actually asks for a general optimizations of compound AND filters, and uses like function just as an example. Other operators are definitely in scope of this issue then. Below benchmark for ==, !=, %in% and !%in%. Important to note that new WIP fwhich is unacceptably slow in one of the cases.

library(data.table)
options(datatable.auto.index=FALSE)
fwhich = data.table:::fwhich

sample_all = function(unq_n, size) {
  stopifnot(unq_n <= size)
  unq_sub = seq_len(unq_n)
  sample(c(unq_sub, sample(unq_sub, size=max(size-unq_n, 0), replace=TRUE)))
}
set.seed(108)
N = 1e8L # timings for 1e8L, 20th, 6 GB mem required
foo = data.table(
  x = sample_all(N/10L, N),
  y = sample_all(N/10L, N),
  z = sample_all(N/10L, N)
)
invisible({foo[c(1L,N)]; foo[c(1L,N)]; foo[c(1L,N)]}) # warmup
if (N==1e6L) { #foo[.N/2L]
  s1 = 80332L; s2 = 8563L; s3 = 63039L
} else if (N==1e8L) {
  s1 = 7065182L; s2 = 7013931L; s3 = 8875689L
}
mid = as.integer(seq(as.integer(N*0.05), as.integer(N*0.95))) # mid ~0.9 obs

## easy
system.time(foo[x==s1][y==s2][z==s3])
#   user  system elapsed 
#  0.429   0.157   0.463 
system.time(foo[x==s1 & y==s2 & z==s3])
#   user  system elapsed 
#  1.051   0.765   1.734 
system.time(foo[fwhich(x==s1 & y==s2 & z==s3)])
#   user  system elapsed 
#  0.071   0.000   0.071 

## hard
system.time(foo[x%in%mid][y%in%mid][z%in%mid])
#   user  system elapsed 
# 36.101   7.376  38.265 
system.time(foo[x%in%mid & y%in%mid & z%in%mid])
#   user  system elapsed 
# 25.043   4.376  28.430 
#system.time(foo[fwhich(x%in%mid & y%in%mid & z%in%mid)])
# TOO SLOW!

## !easy
system.time(foo[x!=s1][y!=s2][z!=s3])
#   user  system elapsed 
#  4.023   5.463   3.308 
system.time(foo[x!=s1 & y!=s2 & z!=s3])
#   user  system elapsed 
#  2.054   2.591   2.551 
system.time(foo[fwhich(x!=s1 & y!=s2 & z!=s3)])
#   user  system elapsed 
#  1.736   1.674   1.144 

## !hard
system.time(foo[!x%in%mid][!y%in%mid][!z%in%mid])
#   user  system elapsed 
# 35.832   7.684  39.046 
system.time(foo[!x%in%mid & !y%in%mid & !z%in%mid])
#   user  system elapsed 
# 25.540   5.131  29.200
system.time(foo[fwhich(!x%in%mid & !y%in%mid & !z%in%mid)])
#   user  system elapsed 
#  1.728   1.651   1.192

code in https://github.com/Rdatatable/data.table/tree/fwhich branch

ColeMiller1 commented 4 years ago

We could include a counter on the amount of times the condition is TRUE. If the counter is above a certain threshold or relative amount, we could warn users and let them know that they can use which or try to rearrange the conditions so that if they know condition_10 results in the most filters, the user should list that as the first argument.


system.time(foo[fwhich(x == 1L & x%in%mid & y%in%mid & z%in%mid)])
jangorecki commented 4 years ago

better to just address this single slow case and eventually have it plugged-in so users don't need to know to call fwhich.

ColeMiller1 commented 4 years ago

I like not having users know to call fwhich. Out of curiosity, how do you address this single slow case? It seems like the optimization is from minimizing the evaluations. In order to know which subsequent operations to perform, you need to store which indices need to be evaluated.

FWIW, this is what I think fwhich translates into for R for the basic dt[ a == 5L & b == 5L & c == 5L]

library(data.table)
dt = data.table(a = sample(50, 1e6, TRUE),
                b = sample(50, 1e6, TRUE),
                c = sample(50, 1e6, TRUE))

options(datatable.auto.index = FALSE)

microbenchmark::microbenchmark(
  short_circuit = {
    ind <- which(dt[['a']] == 5L)
    for(col in c('b','c')){
      ind <- ind[dt[[col]][ind] == 5L]
      }
    dt[ind]
  }
  ,
  dt[a == 5L & b == 5L & c == 5L]
)

Unit: milliseconds
                            expr       min       lq      mean    median        uq     max neval
                   short_circuit  7.491401  7.64405  9.255392  7.800102  8.157352 35.8457   100
 dt[a == 5L & b == 5L & c == 5L] 14.899000 15.16840 20.127040 15.454951 18.597202 63.2137   100
jangorecki commented 4 years ago

Haven't yet got precise idea for that, but some potential way I would explore

other ideas very welcome

ColeMiller1 commented 4 years ago

I like it setting a threshold. There may be some benefit of having this interface visible to users with maybe options(datatable.fwhich_optimize = c(NA, TRUE, FALSE)) where

  1. NA would be auto-optimization with the thresholds, to back out when determined. I assume this would mean that we would want to realize a logical ans in case it's determined that the threshold has been crossed. Meaning, if we probe and make an integer field, we'd be back to square one and have to make a logical field again.
  2. TRUE would be auto-optimization with no thresholds. For users who may have a lot of criteria with ~25% TRUE.
  3. FALSE would not optimize. For users who use a lot of UDF that might depend on the original vector instead of a subset (e.g., rleid() but an actual UDF).

Regarding point 1, it may look something like this in R:

library(data.table)
dt = data.table(a = sample(50, 1e6, TRUE),
                b = sample(50, 1e6, TRUE),
                c = sample(50, 1e6, TRUE))

lgl_ind = dt[['a']] == 5L
if (sum(lgl_ind) <= 0.1 * nrow(dt)) {
  ind <- which(lgl_ind)
  for(col in c('b','c')){
    ind <- ind[dt[[col]][ind] == 5L]
  }
  dt[ind]
} else {
 lgl_ind = lgl_ind & dt[['b']] == 5L & dt[['c']] == 5L 
}

dt[lgl_ind]
jangorecki commented 4 years ago

fwhich is getting a rewrite. It won't be a drop-in replacement of which for sure, mostly because which just wants logical vector, and we want to skip that step. So for example whenever == involves recycling inside which, it will be different in fwhich. We could assume it will be drop-in replacement for a best-practice usage of which, but that is not enough to optimize which to fwhich internally. Still speed ups looks so nice that having fwhich function exported make sense. Yet for handling %in% cases well we need easier interface to C bmerge function.