azmfaridee / mothur

This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.
https://github.com/mothur/mothur
GNU General Public License v3.0
3 stars 1 forks source link

Investigate The Possible Places of Parallelization in the Random Forest Classifier #8

Open azmfaridee opened 12 years ago

azmfaridee commented 12 years ago

Parent Issue: #7

The obvious place to apply parallelization for the new Random Forest module is the tree creation process. If we have N number of trees and T training sets, random forest creates these N trees by Random Sampling by Replacement, each time taking t cases from T where t < T. Creating these trees are independent from each other, so can be parallelized with ease. As per @mothur-westcott's recommendation, there should be fork(), CreateThread(), and MPI implementation of this. The fork() and CreateThread() would be most helpful for mothur's current user base.

Another place to apply parallelization is the reading of input/shortcut files.

References of @mothur-westcott's mail follows:

The classify.seqs command uses MPI in several places. In creating and writting the shortcut files, we only allow the main process to do this. Everyone reads the shortcut file, but only the main process gets the positions in the file and then sends those to the child processes. (bayesian.cpp) In the main command, (classifyseqscommand.cpp line 532), the fasta file is divided and each process does a piece. This parallelization takes advantage of MPI's ability to jointly write to files.

azmfaridee commented 12 years ago

@mothur-westcott wrote in Issue #9:

This looks like a good start to the design. Thinking about the issue of paralellization, where do you see that coming in? Maybe in several places? Without knowing the complexity, and time required for each of the tasks it's hard to see where it would be most valuable. Possible places include for creating the trees, calculating the error rates, calculating individual attributes importance and possibly the bootstrapping? What are your thoughts?

You are right, these places would of the most interest. In the beginning, what I'd like to do is identify the independent tasks where we can implement the parallelization without worrying too much about deadlock/dependency issues. A single barrier mechanism would all we would need in these cases when all the child thread/processes reach that point, the parent thread/process would continue execution. This type of mechanism would a simpler to implement.

Later if we can spare enough time and in case also realize that some complex parts begs parallelization for the overall speedup of the algorithm, without which total performance would suffer a lot, then we can try implementing parallelization in those complex parts.

Btw, please do follow Issue #3, there might be a good chance that we might have to forget about Random Forest altogether and think about another alternative algorithm. More details are on that Issue's page.