aloysius-lim / bigrf

Random forests for R for large data sets, optimized with parallel tree-growing and disk-based memory
91 stars 26 forks source link

Memory Leak when calling predict inside of RStudio #16

Open abelsonlive opened 9 years ago

abelsonlive commented 9 years ago

I've had a persistent bug when using predict on a bigrf model inside of RStudio. Essentially, there seems to be a memory leak which leads to RStudio sucking up all of my machine's RAM and forcing me to shutdown my computer. Curiously this does not happen when I run my script from the command line using Rscript.

abelsonlive commented 9 years ago

screenshot 2015-09-17 14 26 02

aloysius-lim commented 9 years ago

Are you running predict in parallel? Can you share a code snippet?

abelsonlive commented 9 years ago

No I'm not running it in parallel. It's a pretty straightforward implementation. While it's hard to share the exact code snippet as it's been abstracted out into separate functions, its basically this:

require(bigrf)

samp <- sample(1:nrow(iris), nrow(iris) * .6)
train <- iris[samp, ]
test <- iris[-samp,]

m <- bigrfc(train, 
       train$Species, 
       ntree=10, 
       varselect=1:4,
       trace=1)
p <- predict(m, test)

The test set is ~ 2 GB and I'm running it on a machine with 16 GB of ram.

abelsonlive commented 9 years ago

You can see the repository here: https://github.com/enigma-io/smoke-alarm-risk. The functions in question are here: https://github.com/enigma-io/smoke-alarm-risk/blob/master/rscripts/model.R

abiyug commented 8 years ago

How many core processors does your computer have? How long is it taking you to train the 2GB data?

ajnisbet commented 8 years ago

I have the same issue, with basically the same code as @abelsonlive.

Dataset is 300MB on a 6GB machine, 4 core machine, 50 trees. Occurs with and without parallel.

Using R 3.2.3 on Fedora 23.