Closed GoogleCodeExporter closed 9 years ago
I meant to write "and adequate RAM" (because if there is not enough RAM, the
system starts swapping and CPU usage goes way down).
Original comment by ahz...@gmail.com
on 20 Jan 2013 at 4:15
I just started looking at the documentation for the 'parallel' package which
comes with a base installation of R. So far as I can tell, it ought to be
straightforward to use parLapply to parallelize the loop that does the
cross-validation. Does anyone know differently?
In principle, it might be better to parallelize the tree-building in the C++
code (except runs in which stumps are used), but I'm an awful lot more familiar
with R than C++ so would choose the R route.
Original comment by harry.southworth
on 23 Jan 2013 at 1:51
The benefit of doing the parallelization with %dopar% instead of C++ is %dopar%
abstracts the backend, so it works on different environment (such as Linux,
Windows, and SNOW).
This vignette gives an example of building a random forest in parallel, so
doing CV should be similar.
http://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf
Original comment by ahz...@gmail.com
on 23 Jan 2013 at 9:06
I have been parallel processing with gbm but in a slightly different context.
From what I read in the manual, part of what makes the CV calculations slower
is that one needs to use gbm instead of gbm.fit(). The former relies on
model.frame which slows things down. I have used gbm.fit in situations for
repeated bootstrap type evaluations. This works fine in parallel - I've used
Revolution R but any of the parallel routines that your OS support should work.
I have been stymied trying to write my own CV routines with gbm.fit(). Again
I have noticed in the manual that the routines internally shuffle the records
prior to training (just in case targets are grouped together). What this means
is that one can not take advantage of gbm$valid.error for calculations. The CV
holdout needs scored separately from the training runs. Thus one has to do 10
gbm.fit() and 10 predict.gbm() in order to find the appropriate number of trees
and then train the entire data set for final model. At least for just a dozen
or so predictors it does not end up any faster. Have yet to try this with a
few hundred predictors where it might be an advantage. One could send each
fold to a separate core which theoretically would reduce time by 8-10. I
recall reading that caret package has some functionality for parallel
processing with gbm. I believe it was in the setting of exploring parameter
optimization. Have not really explored that avenue. I tend to tune parameters
with gbm.fit() and then train the model with gbm() using CV-10. Probably not
the most correct but working out okay so far.
Original comment by bobaron...@gmail.com
on 24 Jan 2013 at 3:59
Yes, I am suggesting sending "each fold to a separate core"---or more
specifically, by splitting them up using %dopar% (the number of workers may not
equal the number of folds). That is how I understand caret does it, and it
should help a lot.
Actually I would use caret except caret seems to have a worse way of
calculating the optimum number of trees: it requires each evaluation point to
given separately, so evaluating a grid of c(1:4000) could be slow(?) compared
to gbm.perf().
Right now I am doing 5 CV and 4000 trees on a data set that is about 200Kx200,
and it takes about 12 hours while my 8 core machine is running at 12% CPU
(i.e., there are unused resources). If I had enough RAM to use all 8 CPUs, I
should be done in ~2 hours.
Original comment by ahz...@gmail.com
on 24 Jan 2013 at 6:35
I hear you. The caret approach is probably the only way to write your own CV
routine. Since gbm.fit() shuffles the records internally one can not use the
nTrain parameter and be certain of which rows are used for training and
validation. If it didn't shuffle then things would be much easier/faster. You
could combine the $valid.error parameters on the 10 runs without having to
iterate over the 4000 trees 5 times - it would already be done. On my wish list
for gbm.fit would be to add a boolean parameter that would allow one to 'turn
off' this shuffling. I am not the most advanced programmer and perhaps others
may have a better approach. Good luck.
Original comment by bobaron...@gmail.com
on 24 Jan 2013 at 4:18
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
[deleted comment]
Attached is an R script that computes a cross validation model using gbm.fit()
and parallel processing via doParallel and foreach/%dopar% construct.
I keep finding little things to improve so i've reposted this code several
times. This looks like a fairly solid version.
Feel free to check for errors. This is working on my machine.
Original comment by bobaron...@gmail.com
on 25 Jan 2013 at 1:21
Attachments:
I've started work on this. I want to use the parallel package because it comes
with the base install. Take a look at the 'parallel' branch on the source tree
if you're interested.
Original comment by harry.southworth
on 27 Jan 2013 at 9:39
[deleted comment]
Please see version 2.0-9, downloadable from the project's home page.
This passes R CMD check on my Linux system and on Windows. I still need to edit
some of the examples in the help files before considering submitting to CRAN.
I'll also attempt to address some of the other issues raised before doing that.
Please test this out and let me know if you encounter any issues.
Harry
Original comment by harry.southworth
on 28 Jan 2013 at 12:07
I tried V2.0-9 on my home laptop (Win7-64bit-2 cores-RStudio). Tried example:
robusrReg. Seems to recruit cores just fine. The plots appear in the example
but I do get an error:
Error in eval(expr, envir, enclos) : object 'rb8' not found
In addition: Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
Don't think this is from the parallel portion. Defer to Harry.
On Wednesday will be back in office where can test on more robust machine and
also Revolution R environment.
Thanks for improvements!!!!!
Bob
Original comment by bobaron...@gmail.com
on 29 Jan 2013 at 2:08
Hmm, that script is out of date and expects residuals to be available when
they're not. I'll have a think about what to do with it.
Original comment by harry.southworth
on 29 Jan 2013 at 11:07
I used the code above for 0-1 classification using bernouilli distribution and
it worked very well, but now I would need to do a k-classes classification
(factors 0 to 6),thus using multinomial distribution.
I am not sure how to change the error function.
Original comment by plante.b...@gmail.com
on 15 Apr 2013 at 9:18
If your response is a factor, gbm should automatically decide it is multinomial
and tell you. Do myresp <- factor(myresp) to be sure.
Original comment by harry.southworth
on 16 Apr 2013 at 7:24
[deleted comment]
Original comment by harry.southworth
on 26 Nov 2013 at 2:33
Original issue reported on code.google.com by
ahz...@gmail.com
on 20 Jan 2013 at 4:13