Open azmfaridee opened 12 years ago
@mothur-westcott wrote in Issue #9:
This looks like a good start to the design. Thinking about the issue of paralellization, where do you see that coming in? Maybe in several places? Without knowing the complexity, and time required for each of the tasks it's hard to see where it would be most valuable. Possible places include for creating the trees, calculating the error rates, calculating individual attributes importance and possibly the bootstrapping? What are your thoughts?
You are right, these places would of the most interest. In the beginning, what I'd like to do is identify the independent tasks where we can implement the parallelization without worrying too much about deadlock/dependency issues. A single barrier mechanism would all we would need in these cases when all the child thread/processes reach that point, the parent thread/process would continue execution. This type of mechanism would a simpler to implement.
Later if we can spare enough time and in case also realize that some complex parts begs parallelization for the overall speedup of the algorithm, without which total performance would suffer a lot, then we can try implementing parallelization in those complex parts.
Btw, please do follow Issue #3, there might be a good chance that we might have to forget about Random Forest altogether and think about another alternative algorithm. More details are on that Issue's page.
Parent Issue: #7
The obvious place to apply parallelization for the new Random Forest module is the tree creation process. If we have N number of trees and T training sets, random forest creates these N trees by Random Sampling by Replacement, each time taking t cases from T where t < T. Creating these trees are independent from each other, so can be parallelized with ease. As per @mothur-westcott's recommendation, there should be
fork()
,CreateThread()
, andMPI
implementation of this. Thefork()
andCreateThread()
would be most helpful for mothur's current user base.Another place to apply parallelization is the reading of input/shortcut files.
References of @mothur-westcott's mail follows: