Open MichaelChirico opened 6 years ago
good idea, but should be implemented without making new class, just checking attributes should be enough to detect and apply optimization.
I get a faster response again using unique.data.table
:
Unit: milliseconds
expr min lq mean median uq max neval cld
unique(DT$V1) 1285 1285 1345 1345 1405 1405 2 c
DT[, unique(V1)] 1266 1266 1280 1280 1294 1294 2 c
DT[, TRUE, keyby = "V1"][["V1"]] 1012 1012 1022 1022 1031 1031 2 b
unique(DT, by = "V1")[["V1"]] 300 300 346 346 393 393 2 a
@HughParsonage do you have a script for those numbers?
Just microbenchmark
on the expressions.
That said, I can't reproduce the numbers I got back in 2018. Further, here is a direct way in C++ that I wrote last year that seems to produce faster results (for integer vectors at least):
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
IntegerVector do_unique_sorted(IntegerVector x) {
int n = x.length();
int i = 0;
std::vector<int> o;
o.reserve(n);
o.push_back(x[i]);
while (++i < n) {
int xi = x[i];
if (xi != x[i - 1]) {
o.push_back(xi);
}
}
return wrap(o);
}
bench::mark(DT[, TRUE, keyby = "V1"][["V1"]],
unique(DT$V1),
unique(DT, by = "V1")[["V1"]],
do_unique_sorted(DT$V1))
# A tibble: 4 x 14
expression min mean median max `itr/sec` mem_alloc n_gc n_itr
<chr> <bch:tm> <bch:tm> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int>
1 "DT[, TRU~ 132.675ms 135.174ms 133.808ms 140.405ms 7.40 1600.242KB 0 4
2 "unique(D~ 1238.871ms 1238.871ms 1238.871ms 1238.871ms 0.807 1439591.766KB 1 1
3 "unique(D~ 1138.289ms 1138.289ms 1138.289ms 1138.289ms 0.879 391813.172KB 1 1
4 "do_uniqu~ 79.655ms 80.354ms 79.918ms 82.283ms 12.4 393.164KB 0 7
# ... with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>,
# gc <list>
As proposed in #4346, if we would also collect groups information as an attribute to an index... so instead of
o = forderv(DT, by="V1", sort=TRUE, retGrp=FALSE)
use
o = forderv(DT, by="V1", sort=TRUE, retGrp=TRUE)
we could quite easily optimize cases like this one using "lazy forder" #4386
DT[ , unique(V1)]
is tricky to optimize because base R unique
used is not our unique
, and it doesn't have access to attributes like key or index. It is different for DT[, keyby=]
or unique(DT, by=)
where we can access DT and its key or indices.
I think we all agree that optimizing with index rather than key is much more useful.
Here are the timings I am getting now using merged of #4370 and #4386. There could be some speed up after #4467 as well.
The function fdistinct
is from #4370 but merged into #4386 to re-use existing index.
Verbose output is there to explain whenever index optimization was attempted to be made. If there are two lines, then it means that two calls to forder were made. Second AFAIK to revert to original order, which is very cheap because it is integer of length of unique value.
library(data.table)
fdistinct = data.table:::fdistinct
NN = 1e8
set.seed(13013)
DT = data.table(sample(1e5, NN, TRUE)) # 400MB
setindexv(DT, "V1")
system.time(a1<-unique(DT$V1))
# user system elapsed
# 2.952 0.592 3.544
system.time(a2<-DT[ , unique(V1)])
# user system elapsed
# 3.062 0.984 4.047
system.time(a3<-DT[ , TRUE, by = V1][["V1"]])
# user system elapsed
# 10.642 2.513 1.053
##Finding groups using forderv ... forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=0, all1(ascArg)=1
system.time(a4<-DT[ , TRUE, keyby = V1][["V1"]]) ## it is incompatible sort here, but lets just have it for comparison
# user system elapsed
# 1.903 0.371 2.275
##Finding groups using uniqlist on index 'V1' ... 1.914s elapsed (1.586s cpu)
system.time(a5<-unique(DT, by="V1")[["V1"]])
# user system elapsed
# 10.153 1.184 0.769
##forderLazy: opt not possible: is.data.table(DT)=1, sortGroups=0, all1(ascArg)=1
##forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
system.time(a6<-fdistinct(DT, "V1")[["V1"]])
# user system elapsed
# 0.040 0.044 0.075
##forderLazy: using existing index: __V1
##forderLazy: opt not possible: is.data.table(DT)=0, sortGroups=1, all1(ascArg)=1
all.equal(a1, a2) && all.equal(a1, a3) && all.equal(sort(a1), a4) && all.equal(a1, a5) && all.equal(a1, a6)
#[1] TRUE
First two are slowest because they dispatch to base::unique
.
It is nice to see how fdistinct
can outperform other operations here. This is particularly because of
https://github.com/Rdatatable/data.table/blob/2884b29519804c75081c4011daf7d7f7f1f96546/R/mergelist.R#L53-L56
normally when computing unique
we would call forderv(sort=FALSE)
which cannot use index because indices are sort=TRUE
. For unique
we don't need order but starts
attribute only. Those few lines of code checks if there is index available, and if it is, then uses it. Then it has to re-order at the end as well, which is handled by the following line
if (sort && length(o <- forderv(f))) f = f[o] ## this rolls back to original order
The same optimization could be applied to unique(DT, by="V1")
.
To actually optimize calls like DT[, unique(V1)]
it would probably be the best to follow same strategy as optimizing DT[order(...)]
to DT[forderv(DT, ...)]
from #4458
For one-key tables, users will soon benefit from fast unique/duplicated on ALTREP objects:
https://twitter.com/groundwalkergmb/status/1389685109909975040
Which changes the calculus for this FR a bit. Related: #4697
For what it is worth I think fdistinct would be amongst the functions we would use the most. From what I understand there are 2 orders of magnitude to make that would be amazing.
fdistinct is not exported in mergelist PR. I agree it make sense to export it as well. 2 orders of magnitude, having index (retGrp=T) set before.
Once resolved then this Q can be nicely addressed https://stackoverflow.com/questions/74286104/what-is-causing-this-difference-in-performance-in-indexing-between-data-table-an
key
ed tables are already known sorted, so finding unique values is much easier than it is in the general case.Compare:
It seems to me we should be able to match (or exceed) the final time in the second call to
unique
(i.e. within[]
).If we were willing to do something like add a
dt_primary_key
class to the primary key, we could also achieve this speed in the first approach by writing aunique.dt_primary_key
method, but I'm not sure how extensible this is to multiple keys (S4
?)