Closed azmfaridee closed 11 years ago
This looks like a good start to the design. Thinking about the issue of paralellization, where do you see that coming in? Maybe in several places? Without knowing the complexity, and time required for each of the tasks it's hard to see where it would be most valuable. Possible places include for creating the trees, calculating the error rates, calculating individual attributes importance and possibly the bootstrapping? What are your thoughts?
@mothur-westcott : I've created a separate Issue #8 to address this, I'd be adding my thoughts there :)
Here we want to discuss about the code-level details of the Random Forest implementation. I'm copy-pasting our initial code schematics from the application. In the upcoming days, we'd try to elaborate each functions with discussions and pseudocode.
There will be three basic classes that will encompass most the code. They are
RandomForest
- The outer interface for outside, mothur will have access to this class only, this class will useDecisionTree
class for the computation associated with the algorithmDecisionTree
- The core and heart of the algorithm, this class is the heart of everythingDecisionTreeNode
- A helper class/data structure used by DecisionTree class.Of course additional helper classes will also be needed as we progress through our project schedule. As well as other linking classes that will be used together with mothur. They will be created as necessary. Major Functions will be:
calculateTreeErrorRate()
- Will calculate the error rate of a random treecalculateAttributeImportance()
- Will calculate the importance of a particular attributerandomlyShuffleAttribute()
- Will be used bycalculateAttributeImportance()
, the idea is to shuffle a particular attribute and check of that kind of perturbation has any effect on the error/classification, if so, that attribute is valuablecreateTree()
- It will create the decision tree, one of the main entry points of the codepartitionRecursively()
- The main function for the tree split, will do all the work by recursion. This will be the main tree creation algorithm and it will take a while to implement effectivelygetHighestCountCategory()
- Given the data matrix the function will find the most popular category - will be used bypartitionRecursively()
functioncalculateEntropy()
- Given a probability mass function, it will calculate the entropy. We’ll need to R&D and select a value for this.createBootStrappedSample()
- Create a bootstrapped sample of a input dataselectAttributesToInclude()
- Of the total N attributes, it will select the P attributes at random, the value for this random selection is from Breiman’s paper, will be used bypartitionRecursively()
functiongetCategoryProbability()
- Given a data matrix, it will return a probability mass function representing the frequencies of categoriesupdateOutOfBagEstimate()
- Will update the error map by recording a class prediction for a given data recordcalculateForestWideAttributeImportance()
- This will calculate the forest-wide importance levels for all attributes.calculateForestWideErrorRate()
- This will calculate the forest-wide error rate