carolssnz / gradientboostedmodels

Automatically exported from code.google.com/p/gradientboostedmodels
0 stars 0 forks source link

support parallel processing #3

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Train a large data set with 5 CV on a machine with 2+ CPUs and adequate

What is the expected output? What do you see instead?
I expect all the CPUs are used for CV, but instead it is slow.  Some of my 
models take many hours, and after trying different interaction levels and 
maximum trees, it can take days.

What version of the product are you using? On what operating system?
gbm 1.6 (and probably 2.0)

Please provide any additional information below.
Please support %dopar% like the caret package

Original issue reported on code.google.com by ahz...@gmail.com on 20 Jan 2013 at 4:13

GoogleCodeExporter commented 9 years ago
I meant to write "and adequate RAM" (because if there is not enough RAM, the 
system starts swapping and CPU usage goes way down).

Original comment by ahz...@gmail.com on 20 Jan 2013 at 4:15

GoogleCodeExporter commented 9 years ago
I just started looking at the documentation for the 'parallel' package which 
comes with a base installation of R. So far as I can tell, it ought to be 
straightforward to use parLapply to parallelize the loop that does the 
cross-validation. Does anyone know differently?

In principle, it might be better to parallelize the tree-building in the C++ 
code (except runs in which stumps are used), but I'm an awful lot more familiar 
with R than C++ so would choose the R route.

Original comment by harry.southworth on 23 Jan 2013 at 1:51

GoogleCodeExporter commented 9 years ago
The benefit of doing the parallelization with %dopar% instead of C++ is %dopar% 
abstracts the backend, so it works on different environment (such as Linux, 
Windows, and SNOW).

This vignette gives an example of building a random forest in parallel, so 
doing CV should be similar.

http://cran.r-project.org/web/packages/foreach/vignettes/foreach.pdf

Original comment by ahz...@gmail.com on 23 Jan 2013 at 9:06

GoogleCodeExporter commented 9 years ago
I have been parallel processing with gbm but in a slightly different context.  
From what I read in the manual, part of what makes the CV calculations slower 
is that one needs to use gbm instead of gbm.fit().  The former relies on 
model.frame which slows things down.  I have used gbm.fit in situations for 
repeated bootstrap type evaluations.  This works fine in parallel - I've used 
Revolution R but any of the parallel routines that your OS support should work. 
 I have been stymied trying to write my own CV routines with gbm.fit().  Again 
I have noticed in the manual that the routines internally shuffle the records 
prior to training (just in case targets are grouped together).  What this means 
is that one can not take advantage of gbm$valid.error for calculations. The CV 
holdout needs scored separately from the training runs.  Thus one has to do 10 
gbm.fit() and 10 predict.gbm() in order to find the appropriate number of trees 
and then train the entire data set for final model.  At least for just a dozen 
or so predictors it does not end up any faster.  Have yet to try this with a 
few hundred predictors where it might be an advantage.  One could send each 
fold to a separate core which theoretically would reduce time by 8-10.  I 
recall reading that caret package has some functionality for parallel 
processing with gbm.  I believe it was in the setting of exploring parameter 
optimization.  Have not really explored that avenue.  I tend to tune parameters 
with gbm.fit() and then train the model with gbm() using CV-10.  Probably not 
the most correct but working out okay so far.

Original comment by bobaron...@gmail.com on 24 Jan 2013 at 3:59

GoogleCodeExporter commented 9 years ago
Yes, I am suggesting sending "each fold to a separate core"---or more 
specifically, by splitting them up using %dopar% (the number of workers may not 
equal the number of folds).  That is how I understand caret does it, and it 
should help a lot.  

Actually I would use caret except caret seems to have a worse way of 
calculating the optimum number of trees: it requires each evaluation point to 
given separately, so evaluating a grid of c(1:4000) could be slow(?) compared 
to gbm.perf().

Right now I am doing 5 CV and 4000 trees on a data set that is about 200Kx200, 
and it takes about 12 hours while my 8 core machine is running at 12% CPU 
(i.e., there are unused resources).  If I had enough RAM to use all 8 CPUs, I 
should be done in ~2 hours.

Original comment by ahz...@gmail.com on 24 Jan 2013 at 6:35

GoogleCodeExporter commented 9 years ago
I hear you.  The caret approach is probably the only way to write your own CV 
routine.  Since gbm.fit() shuffles the records internally one can not use the 
nTrain parameter and be certain of which rows are used for training and 
validation.  If it didn't shuffle then things would be much easier/faster.  You 
could combine the $valid.error parameters on the 10 runs without having to 
iterate over the 4000 trees 5 times - it would already be done. On my wish list 
for gbm.fit would be to add a boolean parameter that would allow one to 'turn 
off' this shuffling. I am not the most advanced programmer and perhaps others 
may have a better approach.  Good luck.

Original comment by bobaron...@gmail.com on 24 Jan 2013 at 4:18

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Attached is an R script that computes a cross validation model using gbm.fit() 
and parallel processing via doParallel and foreach/%dopar% construct.

I keep finding little things to improve so i've reposted this code several 
times.  This looks like a fairly solid version.

Feel free to check for errors.  This is working on my machine.

Original comment by bobaron...@gmail.com on 25 Jan 2013 at 1:21

Attachments:

GoogleCodeExporter commented 9 years ago
I've started work on this. I want to use the parallel package because it comes 
with the base install. Take a look at the 'parallel' branch on the source tree 
if you're interested.

Original comment by harry.southworth on 27 Jan 2013 at 9:39

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
Please see version 2.0-9, downloadable from the project's home page.

This passes R CMD check on my Linux system and on Windows. I still need to edit 
some of the examples in the help files before considering submitting to CRAN. 
I'll also attempt to address some of the other issues raised before doing that.

Please test this out and let me know if you encounter any issues.

Harry

Original comment by harry.southworth on 28 Jan 2013 at 12:07

GoogleCodeExporter commented 9 years ago
I tried V2.0-9 on my home laptop (Win7-64bit-2 cores-RStudio). Tried example: 
robusrReg.  Seems to recruit cores just fine.  The plots appear in the example 
but I do get an error:
Error in eval(expr, envir, enclos) : object 'rb8' not found
In addition: Warning messages:
1: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
3: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
4: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

Don't think this is from the parallel portion.  Defer to Harry.
On Wednesday will be back in office where can test on more robust machine and 
also Revolution R environment.

Thanks for improvements!!!!!

Bob

Original comment by bobaron...@gmail.com on 29 Jan 2013 at 2:08

GoogleCodeExporter commented 9 years ago
Hmm, that script is out of date and expects residuals to be available when 
they're not. I'll have a think about what to do with it.

Original comment by harry.southworth on 29 Jan 2013 at 11:07

GoogleCodeExporter commented 9 years ago
I used the code above for 0-1 classification using bernouilli distribution and 
it worked very well, but now I would need to do a k-classes classification 
(factors 0 to 6),thus using multinomial distribution.
I am not sure how to change the error function.

Original comment by plante.b...@gmail.com on 15 Apr 2013 at 9:18

GoogleCodeExporter commented 9 years ago
If your response is a factor, gbm should automatically decide it is multinomial 
and tell you. Do myresp <- factor(myresp) to be sure.

Original comment by harry.southworth on 16 Apr 2013 at 7:24

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago

Original comment by harry.southworth on 26 Nov 2013 at 2:33