Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.59k stars 978 forks source link

[R-Forge #1705] Change CJ internally not to expand, for efficiency #652

Open arunsrinivasan opened 10 years ago

arunsrinivasan commented 10 years ago

Submitted by: Matt Dowle; Assigned to: Nobody; R-Forge link

CJ() could return an irregular column length data.table (its inputs unchanged), marked somehow so that binarysearch knows to %% index through the CJ columns, rather than allocating long vectors and populating them as CJ currently does. Purely for speed, with no changes for users unless perhaps CJ() has been used outside [.data.table.

Or, more generally, data.table columns could be marked so that indexing to them knows they should be recycled, without actually allocating and recycling the columns. Discuss on datatable-help needed as under-the-hood code e.g. DT$foo[903] might not work if it was internally a short vector. In reality it's probably only useful for CJ() as regular data.table's only have recycled data perhaps to start with creating them with say an NA column but then are quickly populated. If a large table has a column with the same value (say NA or 0) for every row, perhaps the column is redundant. Can't imagine that recycling >1 items is often useful (other than in CJ()).

jan-glx commented 9 years ago

Yes please. Maybe some non standard evaluation hacking in [.data.table is enough. I guess you should decide how far you want to go with optimizing i (I am thinking of | and &)..If not very far very soon, I would really want to have a way to x[y, on=c("a"=b)] retaining the key of x( see #1268), such that the concerned user can achieve high performance easily in an imperative way. best,Jan

see for example this >50x speedup:

library(data.table)
options(datatable.verbose = F)
dt = data.table(x = sample(1E5, 1E7, replace=T), 
                y = sample(1E5, 1E7, replace=T),
                key = c("x", "y"))
goodX = sample(1E5, 1E4, replace=F)
goodY = sample(1E5, 1E4, replace=F)

fCJ <- function() dt[CJ(goodX, goodY), nomatch=0]
fDJ <- function() setkey(setattr(setattr(dt[data.table("x"=goodX, key="x"), nomatch=0],
                                         "sorted",c("x","y")
                                         )[data.table("y"=goodY, key="y"), nomatch=0, on="y"],
                                 "sorted",c("y","x"))
                         ,x, y)[]
system.time(rCJ <- fCJ())
##    user  system elapsed 
##    2.33    0.56    2.89
system.time(rDJ <- fDJ())
##    user  system elapsed 
##    0.02    0.03    0.05
identical(rCJ, rDJ)
## [1] TRUE

EDIT: Ah sry, before posting I did some further "optimization" and didn't notice it produced wrong results with it. My point now holds hopefully:)

eantonya commented 9 years ago

@jan-glx I'm confused - what exactly are you trying to demonstrate? I didn't run your code, but you show that the result of your two functions is not identical, then you talk about a speedup...

Fwiw I would want to see a different function if this idea gets implemented, and not a drastic change to what CJ does - as I use it the other way around all the time - CJ(...)[...].

jan-glx commented 9 years ago

Here is the example with the "optimization" on again. And another direct implementation that might make clearer when this is useful.

library(data.table)
options(datatable.verbose = F)
dt = data.table(x = sample(1E5, 1E7, replace=T), 
                y = sample(1E5, 1E7, replace=T),
                key = c("x", "y"))
goodX = sample(1E5, 1E4, replace=T)
goodY = sample(1E5, 1E4, replace=T)

fNJ <- function() dt[x %in% goodX & y %in% goodY, nomatch=0]
fCJ <- function() dt[CJ(goodX, goodY, unique = TRUE), nomatch=0]
fDJ <- function() setkey(setattr(setattr(dt[unique(data.table("x"=goodX, key="x")), nomatch=0],
                                         "sorted",c("x","y")
                                         )[unique(data.table("y"=goodY, key="y")), nomatch=0, on="y"],
                                 "sorted",c("y","x"))
                         ,x, y)[]

system.time(rNJ <- fNJ())
##    user  system elapsed 
##    1.17    0.03    1.21
system.time(rCJ <- fCJ())
##    user  system elapsed 
##    2.06    0.59    2.66
system.time(rDJ <- fDJ())
##    user  system elapsed 
##    0.05    0.00    0.05
identical(rNJ, rCJ)
## [1] TRUE
identical(rCJ, rDJ)
## [1] TRUE
franknarf1 commented 9 years ago

Hm, are you just asking that autoindexing be extended? Seems like the clearest demonstration is just

r <- dt[x %in% sort(unique(goodX))][y %in% sort(unique(goodY))] # just as fast as rDJ
identical(r,rDJ) # TRUE

(All of your setattr sort of cloud what's going on to me, not to mention the nomatch=0 which should have no effect in fNJ.) Anyway, I agree that further optimization for fNJ would be nice, as would some convenient shortcut, like DT[list(x=goodX, y=goodY, z=goodZ), on=.ALLIN]

jan-glx commented 9 years ago

Kinda, yes. I would prefer to to describe what I want instead of heaving to explicitly say want R to do. However current optimization only works for some (through the most important) cases. And I am afraid it is supporting the use of data.table in a not data.table way leading to problems as soon as optimization does not kick in. E.g. see:

> library(data.table)
> dt = data.table(x = sample(1E5, 1E8, replace=T))
> setkey(dt,x,verbose=T)
forder took 2.47 sec
reorder took 2.16 sec
> system.time(dt[.(42),])
   user  system elapsed 
      0       0       0 
> system.time(dt[x==42,])
   user  system elapsed 
      0       0       0 
> system.time(dt[42==x,])
   user  system elapsed 
   0.72    0.38    1.09 
> system.time(dt[x %in% 42,])
   user  system elapsed 
      0       0       0 
> system.time(dt[x %in% 42 & TRUE,])
   user  system elapsed 
   4.38    0.54    4.92 

But I am not sure what I want or what the best solution would be...an how much of it concerns this issue... Expressing the WHERE ... AND ... as CJ felt fine for me (apart from the fact that it expands and is terrible slow (for a dt command))... A real query language for i would be cool (one that warns you if it cannot be optimized to joins) Alternatively if one could do joins that preserve the keys or if data.table would automatically create and keep if easily possible indices like in what @jangorecki seems to be implementing That would be awesome as well.

PS: In your %in% query: Why is the key preserved (only) if both good arrays are sorted?

jangorecki commented 9 years ago

Regarding multiple indexes you've linked, it quite heavy implementation as it stores not just integer order sequence for each index but all the column values included in the index for each index, see idxv.R#L10. I'm not sure if this what you discuss here fits to the first post which is clearly focused on single change for CJ.

MichaelChirico commented 5 years ago

where does this stand now that R has ALTREP?

jan-glx commented 5 years ago

where does this stand now that R has ALTREP?

If the following version includes ALTREP then the room for improvement is still present.

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.4    RevoUtils_11.0.0     RevoUtilsMath_11.0.0

loaded via a namespace (and not attached):
[1] compiler_3.5.0 tools_3.5.0   
jangorecki commented 5 years ago

I wouldn't go that way unless there will be officially documented (at least on r-devel mailing) ALTREP api to represent cross product of data.frames, otherwise such API is likely to change without any notice.

MichaelChirico commented 5 months ago

such API is likely to change without any notice.

I think it could be our own ALTREP class, though I still don't find good documentation on how to use ALTREP, exactly :)