aloysius-lim / bigrf

Random forests for R for large data sets, optimized with parallel tree-growing and disk-based memory
91 stars 26 forks source link

sendMaster Error #15

Closed ss5211 closed 9 years ago

ss5211 commented 9 years ago

Hi Aloysius,

I have a big data set, more than 200k rows and about 50 columns. So I followed your instruction in the package menu(if I understand correctly), run the bigrfc for multinomial classification as following:

library(doParallel) registerDoParallel(cores=8) # in the server register 8 cores to grow trees system.time(forest1 <- bigrfc(trainx.orgn.01, quantcut.trainy3.01, ntree=100, cachepath = "~/Documents/cache", trace = 1))
system.time(predict1 <- predict(forest1, testx.orgn.01, y=quantcut.testy3.01, printerrfreq=10, printclserr=TRUE, cachepath = "~/Documents/cache", trace=1))

By the way, 15 out of 50 predictors are categorical variables and they all have less than 15 levels. train.orgn.01 has 80% of the data, and test.orgn.01 has 20% of total. I run it on a server (has more than 8 cores) and get modeling traced. After 'running tree 100 on test examples' message, I got the following errors: Error in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) : long vectors not supported yet: memory.c:3324 error calling combine function: simpleError in treepredict.result$y: $ operator is invalid for atomic vectors

this message repeated 8 times since I run it on 8 cores

Error in table(y, pred, dnn = c("Actual", "Predicted")) : all arguments must have the same length Calls: system.time -> predict -> predict -> .local -> table In addition: Warning message: In mclapply(argsList, FUN, mc.preschedule = preschedule, mc.set.seed = set.seed, : all scheduled cores encountered errors in user code

Do you have a clue why this happened? Does it mean the function mclapply was used internally and it cannot handle long vectors? Even I didn't know what is a long vector, does it mean my dependent variable is so large?

Thank you!

aloysius-lim commented 9 years ago

Hi, what is the output of R.version and Sys.info()?

ss5211 commented 9 years ago

Hi Aloysius,

Thanks for the quick reply. Here is the R and system info:

sessionInfo() R version 3.2.1 (2015-06-18) Platform: x86_64-unknown-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.6 (Santiago)

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

R.version _
platform x86_64-unknown-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 2.1
year 2015
month 06
day 18
svn rev 68531
language R
version.string R version 3.2.1 (2015-06-18)

attached base packages: [1] stats graphics grDevices utils datasets methods base

Sys.info() sysname release "Linux" "2.6.32-504.1.3.el6.x8664" version nodename "#1 SMP Fri Oct 31 11:37:10 EDT 2014" "***" machine login "x8664" "**_" user effective_user "***" "_**"

By the way, if I subsample 10% of the dataset, the error didn't come up and it fished very quickly. And the server has 12 cores and 132G RAM in total.

aloysius-lim commented 9 years ago

The error seems to be from sendMaster which is used by the multicore package to send results from child processes back to the master process. From this StackOverflow thread, it seems that the results of the child processes are too large for mclapply to send back to the master.

Bigrf uses foreach, which, if you use the parallel backend (via doParallel), calls mclapply. Try using doMC instead, and see if switching to the multicore backend solves the problem.

ss5211 commented 9 years ago

Hey Aloysius,

like you said, bigrf uses foreach, and hindered by the limitation of mclapply so far while running on big dataset. doMC is also a parallel backend for the foreach/%dopar% function using the multicore functionality. I do re-run the code with registerDoMC(), it returns the same error message. And the iterator package mentioned in the SO thread didn't help as well. Do you know any other packages that can run parallel without utilizing foreach function?

Thank you! Jackson

aloysius-lim commented 9 years ago

I'm afraid I do not know of any other random forest package that runs in parallel and does not use foreach.

ss5211 commented 9 years ago

Thank you for taking your time, Aloysius. I think so far my work around is to train on small subset and predict on small samples multiple times and waiting for update of mclapply to support long vectors.

Inventitech commented 7 years ago

Hi,

Sorry for being a bit late to this party. A proper fix (although it has performance implications) for you might be to set mc.preschedule=FALSE for the mclapply call. The problem is that there is no explicit call to mclapply in your own code, since it is implicitly invoked by foreach. You can, however, specify the options the doMC packages uses for its wrapped call to mclapply like so:

mcoptions <- list(preschedule=FALSE)
foreach(..., .options.multicore=mcoptions) 

This should make your code a bit slower, but also make it work sans further modifications.