Create Code Schematics/Pseudocode for the Random Forest Implementation

azmfaridee commented 12 years ago

Here we want to discuss about the code-level details of the Random Forest implementation. I'm copy-pasting our initial code schematics from the application. In the upcoming days, we'd try to elaborate each functions with discussions and pseudocode.

There will be three basic classes that will encompass most the code. They are

RandomForest - The outer interface for outside, mothur will have access to this class only, this class will use DecisionTree class for the computation associated with the algorithm
DecisionTree - The core and heart of the algorithm, this class is the heart of everything
DecisionTreeNode - A helper class/data structure used by DecisionTree class.

Of course additional helper classes will also be needed as we progress through our project schedule. As well as other linking classes that will be used together with mothur. They will be created as necessary. Major Functions will be:

calculateTreeErrorRate() - Will calculate the error rate of a random tree
calculateAttributeImportance() - Will calculate the importance of a particular attribute
randomlyShuffleAttribute() - Will be used by calculateAttributeImportance(), the idea is to shuffle a particular attribute and check of that kind of perturbation has any effect on the error/classification, if so, that attribute is valuable
createTree() - It will create the decision tree, one of the main entry points of the code
partitionRecursively() - The main function for the tree split, will do all the work by recursion. This will be the main tree creation algorithm and it will take a while to implement effectively
getHighestCountCategory() - Given the data matrix the function will find the most popular category - will be used by partitionRecursively() function
calculateEntropy() - Given a probability mass function, it will calculate the entropy. We’ll need to R&D and select a value for this.
createBootStrappedSample() - Create a bootstrapped sample of a input data
selectAttributesToInclude() - Of the total N attributes, it will select the P attributes at random, the value for this random selection is from Breiman’s paper, will be used by partitionRecursively() function
getCategoryProbability() - Given a data matrix, it will return a probability mass function representing the frequencies of categories
updateOutOfBagEstimate() - Will update the error map by recording a class prediction for a given data record
calculateForestWideAttributeImportance() - This will calculate the forest-wide importance levels for all attributes.
calculateForestWideErrorRate() - This will calculate the forest-wide error rate

mothur-westcott commented 12 years ago

This looks like a good start to the design. Thinking about the issue of paralellization, where do you see that coming in? Maybe in several places? Without knowing the complexity, and time required for each of the tasks it's hard to see where it would be most valuable. Possible places include for creating the trees, calculating the error rates, calculating individual attributes importance and possibly the bootstrapping? What are your thoughts?

azmfaridee commented 12 years ago

@mothur-westcott : I've created a separate Issue #8 to address this, I'd be adding my thoughts there :)

azmfaridee / mothur

Create Code Schematics/Pseudocode for the Random Forest Implementation #9