unique can be optimized on keyed data.tables

MichaelChirico commented 6 years ago

keyed tables are already known sorted, so finding unique values is much easier than it is in the general case.

Compare:

NN = 1e8

set.seed(13013)
# about 400 MB, if you're RAM-conscious
DT = data.table(sample(1e5, NN, TRUE), key = 'V1')

system.time(unique(DT$V1))
#    user  system elapsed 
#   1.354   0.415   1.798 

system.time(DT[ , unique(V1)])
#    user  system elapsed 
#   1.266   0.414   1.681 

system.time(DT[ , TRUE, keyby = V1])
#    user  system elapsed 
#   0.375   0.000   0.375

It seems to me we should be able to match (or exceed) the final time in the second call to unique (i.e. within []).

If we were willing to do something like add a dt_primary_key class to the primary key, we could also achieve this speed in the first approach by writing a unique.dt_primary_key method, but I'm not sure how extensible this is to multiple keys (S4?)

jangorecki commented 6 years ago

good idea, but should be implemented without making new class, just checking attributes should be enough to detect and apply optimization.

HughParsonage commented 6 years ago

I get a faster response again using unique.data.table:

Unit: milliseconds
                             expr  min   lq mean median   uq  max neval cld
                    unique(DT$V1) 1285 1285 1345   1345 1405 1405     2   c
                 DT[, unique(V1)] 1266 1266 1280   1280 1294 1294     2   c
 DT[, TRUE, keyby = "V1"][["V1"]] 1012 1012 1022   1022 1031 1031     2  b 
    unique(DT, by = "V1")[["V1"]]  300  300  346    346  393  393     2 a

jangorecki commented 4 years ago

@HughParsonage do you have a script for those numbers?

HughParsonage commented 4 years ago

Just microbenchmark on the expressions.

HughParsonage commented 4 years ago

That said, I can't reproduce the numbers I got back in 2018. Further, here is a direct way in C++ that I wrote last year that seems to produce faster results (for integer vectors at least):

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
IntegerVector do_unique_sorted(IntegerVector x) {
  int n = x.length();
  int i = 0;
  std::vector<int> o;
  o.reserve(n);
  o.push_back(x[i]);

  while (++i < n) {
    int xi = x[i];
    if (xi != x[i - 1]) {
      o.push_back(xi);
    }

  }
  return wrap(o);
}

bench::mark(DT[, TRUE, keyby = "V1"][["V1"]], 
            unique(DT$V1),
            unique(DT, by = "V1")[["V1"]],
            do_unique_sorted(DT$V1))

# A tibble: 4 x 14
  expression        min       mean     median        max `itr/sec`     mem_alloc  n_gc n_itr
  <chr>        <bch:tm>   <bch:tm>   <bch:tm>   <bch:tm>     <dbl>     <bch:byt> <dbl> <int>
1 "DT[, TRU~  132.675ms  135.174ms  133.808ms  140.405ms     7.40     1600.242KB     0     4
2 "unique(D~ 1238.871ms 1238.871ms 1238.871ms 1238.871ms     0.807 1439591.766KB     1     1
3 "unique(D~ 1138.289ms 1138.289ms 1138.289ms 1138.289ms     0.879  391813.172KB     1     1
4 "do_uniqu~   79.655ms   80.354ms   79.918ms   82.283ms    12.4       393.164KB     0     7
# ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>,
#   gc <list>

jangorecki commented 4 years ago

As proposed in #4346, if we would also collect groups information as an attribute to an index... so instead of

o = forderv(DT, by="V1", sort=TRUE, retGrp=FALSE)

use

o = forderv(DT, by="V1", sort=TRUE, retGrp=TRUE)

we could quite easily optimize cases like this one using "lazy forder" #4386

jangorecki commented 4 years ago

DT[ , unique(V1)]

is tricky to optimize because base R unique used is not our unique, and it doesn't have access to attributes like key or index. It is different for DT[, keyby=] or unique(DT, by=) where we can access DT and its key or indices.

jangorecki commented 4 years ago

I think we all agree that optimizing with index rather than key is much more useful.

Here are the timings I am getting now using merged of #4370 and #4386. There could be some speed up after #4467 as well. The function fdistinct is from #4370 but merged into #4386 to re-use existing index.

Verbose output is there to explain whenever index optimization was attempted to be made. If there are two lines, then it means that two calls to forder were made. Second AFAIK to revert to original order, which is very cheap because it is integer of length of unique value.

library(data.table)
fdistinct = data.table:::fdistinct
NN = 1e8
set.seed(13013)
DT = data.table(sample(1e5, NN, TRUE)) # 400MB
setindexv(DT, "V1")

system.time(a1<-unique(DT$V1))
#   user  system elapsed 
#  2.952   0.592   3.544  
system.time(a2<-DT[ , unique(V1)])
#   user  system elapsed 
#  3.062   0.984   4.047 
system.time(a3<-DT[ , TRUE, by = V1][["V1"]])
#   user  system elapsed 
# 10.642   2.513   1.053 
##Finding groups using forderv ... forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
system.time(a4<-DT[ , TRUE, keyby = V1][["V1"]]) ## it is incompatible sort here, but lets just have it for comparison
#   user  system elapsed 
#  1.903   0.371   2.275 
##Finding groups using uniqlist on index 'V1' ... 1.914s elapsed (1.586s cpu) 
system.time(a5<-unique(DT, by="V1")[["V1"]])
#   user  system elapsed 
# 10.153   1.184   0.769 
##forderLazy: opt not possible: is.data.table(DT)=1, sortGroups=0, all1(ascArg)=1
##forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
system.time(a6<-fdistinct(DT, "V1")[["V1"]])
#   user  system elapsed 
#  0.040   0.044   0.075 
##forderLazy: using existing index: __V1
##forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1

all.equal(a1, a2) && all.equal(a1, a3) && all.equal(sort(a1), a4) && all.equal(a1, a5) && all.equal(a1, a6)
#[1] TRUE

First two are slowest because they dispatch to base::unique.

It is nice to see how fdistinct can outperform other operations here. This is particularly because of https://github.com/Rdatatable/data.table/blob/2884b29519804c75081c4011daf7d7f7f1f96546/R/mergelist.R#L53-L56 normally when computing unique we would call forderv(sort=FALSE) which cannot use index because indices are sort=TRUE. For unique we don't need order but starts attribute only. Those few lines of code checks if there is index available, and if it is, then uses it. Then it has to re-order at the end as well, which is handled by the following line

if (sort && length(o <- forderv(f))) f = f[o] ## this rolls back to original order

The same optimization could be applied to unique(DT, by="V1").

jangorecki commented 4 years ago

To actually optimize calls like DT[, unique(V1)] it would probably be the best to follow same strategy as optimizing DT[order(...)] to DT[forderv(DT, ...)] from #4458

MichaelChirico commented 3 years ago

For one-key tables, users will soon benefit from fast unique/duplicated on ALTREP objects:

https://twitter.com/groundwalkergmb/status/1389685109909975040

Which changes the calculus for this FR a bit. Related: #4697

statquant commented 2 years ago

For what it is worth I think fdistinct would be amongst the functions we would use the most. From what I understand there are 2 orders of magnitude to make that would be amazing.

jangorecki commented 2 years ago

fdistinct is not exported in mergelist PR. I agree it make sense to export it as well. 2 orders of magnitude, having index (retGrp=T) set before.

Once resolved then this Q can be nicely addressed https://stackoverflow.com/questions/74286104/what-is-causing-this-difference-in-performance-in-indexing-between-data-table-an

Rdatatable / data.table

unique can be optimized on keyed data.tables #2947