Open celestrist opened 10 years ago
I'm not sure if that is currently possible. Upon adding new processors, the package functions (ie build_tree) need to be sent over to them. So something like this returns a exception on 2: ERROR: build_tree not defined
:
function build_forest(labels, features, nsubfeatures, ntrees, ncpu=1)
if ncpu > nprocs()
addprocs(ncpu - nprocs())
end
Nlabels = length(labels)
Nsamples = int(0.7 * Nlabels)
forest = @parallel (vcat) for i in [1:ntrees]
inds = rand(1:Nlabels, Nsamples)
build_tree(labels[inds], features[inds,:], nsubfeatures)
end
return [forest]
end
Also, running addprocs(3)
before using DecisionTree
works, but not if the order of the two commands is reversed. Let me do a bit of digging for a solution or work-around.
@bensadeghi I attempted to parallelize Random Forest on a moderately large dataset (2 million observations and 6 features) and found that in fact the parallelized version launched with "julia -p 3 test.jl" ran slower than with no -p parameter. Without parallelization the RF ran in 18 minutes, and with -p 3 it ran in 25 minutes. Since testing takes so long it is hard to know if this was just an abnormal case. Using Julia 6; I will try to test on Julia 4 and see if I get the same.
Edit: Julia 4 is fine, both RF and CV run completely. Times were good improvement over no parallelization for RF (11 minutes compared to 18), but not so dramatic on CV (22 minutes vs. 25).
I was using the doc format of running a RF followed by a 3-fold cross validation; with no parallelization the CV ran fine in 32 minutes, but in parallel again with -p 3 memory was quickly exhausted, went into swap, and I attempted to force quit with Ctrl-C. This returned me to the command prompt, but multiple Julia processes kept running and had to be force killed. No other major programmes were running at the same time demanding resources.
In https://github.com/JuliaLang/julia/issues/6631 there was a discussion of processes running on which resulted in a fix. So I guess one question would be how to cleanly quit from the bash CLI as opposed to being able to send an "interrupt()" from the Julia CLI which is not available.
Using 4 cores and 16 GB ram.
julia> versioninfo()
Julia Version 0.6.0-dev.252
Commit f5418ac* (2016-08-17 04:16 UTC)
Platform Info:
System: Linux (x86_64-suse-linux)
CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
I've been looking for an example of distributing a Random Forest run with a reasonably large dataset (120mil observations). Can you share your code -- and save me a few long nights? Thanks,
Is there a simple way to turn off the @parallel for building a random forest model (line 243 of classification.jl)? I am attempting to do a parameter grid search in parallel. Because my dataset is relatively small, the parallelization of the forest creation is increasing the overhead.
I noticed that the random forest classifier is intended to build trees in parallel. However, we must manually add julia processes by either invoking the -p option or use the addproc function. I am wondering if the classifier can automatically add processes. perhaps by adding an option indicating how many processes the user wants to use, which default value is 1.
Caleb