beanumber / openWAR

An R package enabling the computation of openWAR using MLBAM data
99 stars 33 forks source link

make makeWAR faster and more memory efficient #43

Closed beanumber closed 8 years ago

davidbmitchell commented 9 years ago

Using the data.table package might speed things up, but it could entail some refactoring I believe.

beanumber commented 9 years ago

Definitely something to look into for v2.0.

beanumber commented 9 years ago

IIRC, one of the things that makes this slow is getModelFieldingCollective(), where we use the 2D kernel smoother KernSmooth::bkde2D(). For some reason the developers do not supply a predict.bkde2D() method, and so we had to implement a hack. The hack is slow because it involves Hmisc::whichClosest().

The fitting of the bkde2D() model is fine -- it's the looking up of the predicted values that is slow. If someone can think of a better way to do this, that would improve the performance of makeWAR().

beanumber commented 9 years ago

Assume that data is a data.frame with hundreds of thousands of rows and 50 or so columns. Which is faster? Which is more memory efficient?

res <- myfun(data)

vs.

res <- myfun(select(data, onlyTheColumnsINeedToComputeMyFun))

?? In general, how does R make copies of data frames for passing through functions?

znmeb commented 9 years ago

My impression is that once you rewrite everything from "native R" to dplyr, all kinds of efficiencies magically appear via Rcpp and relentless optimizations of the dplyr engineers. So I'd say

res <- dplyr::select %>% myfun

beanumber commented 9 years ago

Yeah, but what I'm interested in is the copying of objects. That is, does the object data get copied in the first case, but only a smaller object get copied in the second case? If so, does the latter have a smaller memory footprint?

znmeb commented 9 years ago

That should be easy to benchmark / profile. But you also have to monitor the garbage collector. :-( IMHO if you've converted to dplyr everything else is premature optimization. But I haven't tested any of it. I'm tied up this week but I could do some performance testing next week.

By the way - I discovered that the "MASS" select can mask the "dplyr" select - you have to explicitly define the namespace on the call. You may have already run into that and fixed it but it bit me last week.

rpruim commented 9 years ago

The high level answer is the R is lazy about copying. If you don't change an object, no copying is done. When there are two names for an object, each knows it is not the only name and if a change is made to an object that has other names, that is when the copying happens.

So

x[1] <- 1   # probably fast if no other names refer to the same object as x, but depends on the type of object x is
x <- y      # takes basically no time, even when y is large
x[1] <- 2   # might take a very long time if y is large.

Figuring out exactly when copying happens and how that affects our code isn't entirely trivial -- it depends on how the functions you are using are written, whether they use primatives, etc. See http://adv-r.had.co.nz/memory.html#modification, which includes this:

Generally, provided that the object is not referred to elsewhere, any primitive replacement function will modify in place. This includes [[<-, [<-, @<-, $<-, attr<-, attributes<-, class<-, dim<-, dimnames<-, names<-, and levels<-. To be precise, all non-primitive functions increment refs, but a primitive function may be written in such a way that it doesn’t. The rules are sufficiently complicated that there’s little point in trying to memorise them. Instead, you should approach the problem practically by using refs() and address() to figure out when objects are being copied.

While determining that copies are being made is not hard, preventing such behaviour is. If you find yourself resorting to exotic tricks to avoid copies, it may be time to rewrite your function in C++, as described in Rcpp.

I haven't looked at your code, but if you are just passing the data frame along for information extraction and don't modify it, this should be roughly like passing a pointer to the data. But there is no answer to your question without knowing what happens inside the function.

dplyr functions are indeed optimized by (a) using C code and (b) avoiding copying that isn't necessary. I think that for most things dplyr is only a bit slower that data.table, so if you need dramatic speed up, you won't get it by switching from dplyr to data.table, but you might be using one or the other of these as opposed to using neither.

beanumber commented 9 years ago

The revision to getModelFieldingCollective() in e2794d4a made this part of the process MUCH faster!

I was doing a dumb lookup, but there was an easier way.

The speed up to makeWAR() is about 2x!

Running:

getWAR(May)

used to take 25.2 seconds, but now takes 12.7!

Running:

getWAR(MLBAM2012)

used to take 156.2 seconds, but now takes 70.4!

beanumber commented 9 years ago

FYI, I wrote the predict.bkde2D(). And it wasn't Hmisc::whichClosest() that was slow, it was the line of code after that that was slow.

beanumber commented 8 years ago

I'm closing this -- we can revisit for version 2.0