JuliaAI / DecisionTree.jl

Julia implementation of Decision Tree (CART) and Random Forest algorithms
Other
342 stars 101 forks source link

Parallel Random Forest #2

Open celestrist opened 10 years ago

celestrist commented 10 years ago

I noticed that the random forest classifier is intended to build trees in parallel. However, we must manually add julia processes by either invoking the -p option or use the addproc function. I am wondering if the classifier can automatically add processes. perhaps by adding an option indicating how many processes the user wants to use, which default value is 1.

Caleb

bensadeghi commented 10 years ago

I'm not sure if that is currently possible. Upon adding new processors, the package functions (ie build_tree) need to be sent over to them. So something like this returns a exception on 2: ERROR: build_tree not defined:

function build_forest(labels, features, nsubfeatures, ntrees, ncpu=1)
                  if ncpu > nprocs()
                       addprocs(ncpu - nprocs())
                  end
                  Nlabels = length(labels)
                  Nsamples = int(0.7 * Nlabels)
                  forest = @parallel (vcat) for i in [1:ntrees]
                      inds = rand(1:Nlabels, Nsamples)
                      build_tree(labels[inds], features[inds,:], nsubfeatures)
                  end
                  return [forest]
              end

Also, running addprocs(3) before using DecisionTree works, but not if the order of the two commands is reversed. Let me do a bit of digging for a solution or work-around.

colbec commented 7 years ago

@bensadeghi I attempted to parallelize Random Forest on a moderately large dataset (2 million observations and 6 features) and found that in fact the parallelized version launched with "julia -p 3 test.jl" ran slower than with no -p parameter. Without parallelization the RF ran in 18 minutes, and with -p 3 it ran in 25 minutes. Since testing takes so long it is hard to know if this was just an abnormal case. Using Julia 6; I will try to test on Julia 4 and see if I get the same.

Edit: Julia 4 is fine, both RF and CV run completely. Times were good improvement over no parallelization for RF (11 minutes compared to 18), but not so dramatic on CV (22 minutes vs. 25).

I was using the doc format of running a RF followed by a 3-fold cross validation; with no parallelization the CV ran fine in 32 minutes, but in parallel again with -p 3 memory was quickly exhausted, went into swap, and I attempted to force quit with Ctrl-C. This returned me to the command prompt, but multiple Julia processes kept running and had to be force killed. No other major programmes were running at the same time demanding resources.

In https://github.com/JuliaLang/julia/issues/6631 there was a discussion of processes running on which resulted in a fix. So I guess one question would be how to cleanly quit from the bash CLI as opposed to being able to send an "interrupt()" from the Julia CLI which is not available.

Using 4 cores and 16 GB ram.

julia> versioninfo()
Julia Version 0.6.0-dev.252
Commit f5418ac* (2016-08-17 04:16 UTC)
Platform Info:
  System: Linux (x86_64-suse-linux)
  CPU: Intel(R) Core(TM) i5-4460  CPU @ 3.20GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
ChicagoStats commented 7 years ago

I've been looking for an example of distributing a Random Forest run with a reasonably large dataset (120mil observations). Can you share your code -- and save me a few long nights? Thanks,

nsenno commented 6 years ago

Is there a simple way to turn off the @parallel for building a random forest model (line 243 of classification.jl)? I am attempting to do a parameter grid search in parallel. Because my dataset is relatively small, the parallelization of the forest creation is increasing the overhead.