Open arunsrinivasan opened 10 years ago
Yes please. Maybe some non standard evaluation hacking in [.data.table
is enough.
I guess you should decide how far you want to go with optimizing i
(I am thinking of | and &)..If not very far very soon, I would really want to have a way to x[y, on=c("a"=b)] retaining the key of x( see #1268), such that the concerned user can achieve high performance easily in an imperative way.
best,Jan
see for example this >50x speedup:
library(data.table)
options(datatable.verbose = F)
dt = data.table(x = sample(1E5, 1E7, replace=T),
y = sample(1E5, 1E7, replace=T),
key = c("x", "y"))
goodX = sample(1E5, 1E4, replace=F)
goodY = sample(1E5, 1E4, replace=F)
fCJ <- function() dt[CJ(goodX, goodY), nomatch=0]
fDJ <- function() setkey(setattr(setattr(dt[data.table("x"=goodX, key="x"), nomatch=0],
"sorted",c("x","y")
)[data.table("y"=goodY, key="y"), nomatch=0, on="y"],
"sorted",c("y","x"))
,x, y)[]
system.time(rCJ <- fCJ())
## user system elapsed
## 2.33 0.56 2.89
system.time(rDJ <- fDJ())
## user system elapsed
## 0.02 0.03 0.05
identical(rCJ, rDJ)
## [1] TRUE
EDIT: Ah sry, before posting I did some further "optimization" and didn't notice it produced wrong results with it. My point now holds hopefully:)
@jan-glx I'm confused - what exactly are you trying to demonstrate? I didn't run your code, but you show that the result of your two functions is not identical, then you talk about a speedup...
Fwiw I would want to see a different function if this idea gets implemented, and not a drastic change to what CJ
does - as I use it the other way around all the time - CJ(...)[...]
.
Here is the example with the "optimization" on again. And another direct implementation that might make clearer when this is useful.
library(data.table)
options(datatable.verbose = F)
dt = data.table(x = sample(1E5, 1E7, replace=T),
y = sample(1E5, 1E7, replace=T),
key = c("x", "y"))
goodX = sample(1E5, 1E4, replace=T)
goodY = sample(1E5, 1E4, replace=T)
fNJ <- function() dt[x %in% goodX & y %in% goodY, nomatch=0]
fCJ <- function() dt[CJ(goodX, goodY, unique = TRUE), nomatch=0]
fDJ <- function() setkey(setattr(setattr(dt[unique(data.table("x"=goodX, key="x")), nomatch=0],
"sorted",c("x","y")
)[unique(data.table("y"=goodY, key="y")), nomatch=0, on="y"],
"sorted",c("y","x"))
,x, y)[]
system.time(rNJ <- fNJ())
## user system elapsed
## 1.17 0.03 1.21
system.time(rCJ <- fCJ())
## user system elapsed
## 2.06 0.59 2.66
system.time(rDJ <- fDJ())
## user system elapsed
## 0.05 0.00 0.05
identical(rNJ, rCJ)
## [1] TRUE
identical(rCJ, rDJ)
## [1] TRUE
Hm, are you just asking that autoindexing be extended? Seems like the clearest demonstration is just
r <- dt[x %in% sort(unique(goodX))][y %in% sort(unique(goodY))] # just as fast as rDJ
identical(r,rDJ) # TRUE
(All of your setattr
sort of cloud what's going on to me, not to mention the nomatch=0
which should have no effect in fNJ
.) Anyway, I agree that further optimization for fNJ
would be nice, as would some convenient shortcut, like DT[list(x=goodX, y=goodY, z=goodZ), on=.ALLIN]
Kinda, yes. I would prefer to to describe what I want instead of heaving to explicitly say want R to do. However current optimization only works for some (through the most important) cases. And I am afraid it is supporting the use of data.table in a not data.table way leading to problems as soon as optimization does not kick in. E.g. see:
> library(data.table)
> dt = data.table(x = sample(1E5, 1E8, replace=T))
> setkey(dt,x,verbose=T)
forder took 2.47 sec
reorder took 2.16 sec
> system.time(dt[.(42),])
user system elapsed
0 0 0
> system.time(dt[x==42,])
user system elapsed
0 0 0
> system.time(dt[42==x,])
user system elapsed
0.72 0.38 1.09
> system.time(dt[x %in% 42,])
user system elapsed
0 0 0
> system.time(dt[x %in% 42 & TRUE,])
user system elapsed
4.38 0.54 4.92
But I am not sure what I want or what the best solution would be...an how much of it concerns this issue... Expressing the WHERE ... AND ... as CJ felt fine for me (apart from the fact that it expands and is terrible slow (for a dt command))... A real query language for i would be cool (one that warns you if it cannot be optimized to joins) Alternatively if one could do joins that preserve the keys or if data.table would automatically create and keep if easily possible indices like in what @jangorecki seems to be implementing That would be awesome as well.
PS: In your %in% query: Why is the key preserved (only) if both good arrays are sorted?
Regarding multiple indexes you've linked, it quite heavy implementation as it stores not just integer order sequence for each index but all the column values included in the index for each index, see idxv.R#L10. I'm not sure if this what you discuss here fits to the first post which is clearly focused on single change for CJ.
where does this stand now that R has ALTREP
?
where does this stand now that R has
ALTREP
?
If the following version includes ALTREP
then the room for improvement is still present.
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.11.4 RevoUtils_11.0.0 RevoUtilsMath_11.0.0
loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0
I wouldn't go that way unless there will be officially documented (at least on r-devel mailing) ALTREP api to represent cross product of data.frames, otherwise such API is likely to change without any notice.
such API is likely to change without any notice.
I think it could be our own ALTREP class, though I still don't find good documentation on how to use ALTREP, exactly :)
Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link
CJ() could return an irregular column length data.table (its inputs unchanged), marked somehow so that binarysearch knows to %% index through the CJ columns, rather than allocating long vectors and populating them as CJ currently does. Purely for speed, with no changes for users unless perhaps CJ() has been used outside [.data.table.
Or, more generally, data.table columns could be marked so that indexing to them knows they should be recycled, without actually allocating and recycling the columns. Discuss on datatable-help needed as under-the-hood code e.g. DT$foo[903] might not work if it was internally a short vector. In reality it's probably only useful for CJ() as regular data.table's only have recycled data perhaps to start with creating them with say an NA column but then are quickly populated. If a large table has a column with the same value (say NA or 0) for every row, perhaps the column is redundant. Can't imagine that recycling >1 items is often useful (other than in CJ()).