Week 7: Finish the Implementation of Random Forest Algorithm

azmfaridee commented 12 years ago

Related Issues: #3, #14, #15

As the per the initial proposal, code the remaining part of the Random Forest algorithm and finish the implementations. Most of this weeks task week be wrapping up the works that was done in Week 5 and Week 6. Since we are now able to create decision trees from the bootstrapped data, it's now time to aggregate the results and build the ensemble classifier feature selector. Specific functions of interests are:

calculateTreeErrorRate()
calculateAttributeImportance()
randomlyShuffleAttribute()
updateOutOfBagEstimate()
calculateForestWideAttributeImportance()
calculateForestWideErrorRate()

End of Week Deliverable

A finished code for the feature selection method that can work with microbial ecology data.

azmfaridee commented 12 years ago

Right now the most pressing issue is the handing of the sparse data that we have, a lot of features are zero filled, while others are just same data repeated over. This features need to be avoided, but this might also hurt the decision tree. I'm currently investigating ways to find the tradeoff between them

azmfaridee commented 12 years ago

Here are two sample tree generated for inpatient.final.an.0.03.subsample.avg.matrix file. The file contains 187 training samples and 845 features (OTUs).

Tree 1

root [ gen: 0 ] ( 536 < X2 )
     leftChild [ gen: 1 ] ( 42 < X22 )
          leftChild [ gen: 2 ] ( 9 < X31 )
               leftChild [ gen: 3 ] ( 2 < X36 )
                    leftChild [ gen: 4 ] ( 8 < X33 )
                         leftChild [ gen: 5 ] ( 3 < X37 )
                              leftChild [ gen: 6 ] ( 1 < X2 )
                                   leftChild [ gen: 7 ] ( classified to: 0, samples: 13 )
                                   rightChild [ gen: 7 ] ( 131 < X7 )
                                        leftChild [ gen: 8 ] ( 2 < X24 )
                                             leftChild [ gen: 9 ] ( 1 < X3 )
                                                  leftChild [ gen: 10 ] ( classified to: 0, samples: 2 )
                                                  rightChild [ gen: 10 ] ( 1056 < X1 )
                                                       leftChild [ gen: 11 ] ( classified to: 1, samples: 18 )
                                                       rightChild [ gen: 11 ] ( classified to: 0, samples: 1 )
                                             rightChild [ gen: 9 ] ( 302 < X1 )
                                                  leftChild [ gen: 10 ] ( classified to: 1, samples: 3 )
                                                  rightChild [ gen: 10 ] ( classified to: 0, samples: 6 )
                                        rightChild [ gen: 8 ] ( classified to: 0, samples: 6 )
                              rightChild [ gen: 6 ] ( classified to: 1, samples: 10 )
                         rightChild [ gen: 5 ] ( classified to: 1, samples: 14 )
                    rightChild [ gen: 4 ] ( 61 < X27 )
                         leftChild [ gen: 5 ] ( 997 < X7 )
                              leftChild [ gen: 6 ] ( classified to: 0, samples: 25 )
                              rightChild [ gen: 6 ] ( classified to: 1, samples: 1 )
                         rightChild [ gen: 5 ] ( classified to: 1, samples: 3 )
               rightChild [ gen: 3 ] ( 27 < X31 )
                    leftChild [ gen: 4 ] ( classified to: 1, samples: 21 )
                    rightChild [ gen: 4 ] ( 9 < X22 )
                         leftChild [ gen: 5 ] ( 52 < X34 )
                              leftChild [ gen: 6 ] ( classified to: 1, samples: 8 )
                              rightChild [ gen: 6 ] ( classified to: 0, samples: 2 )
                         rightChild [ gen: 5 ] ( classified to: 0, samples: 3 )
          rightChild [ gen: 2 ] ( 1 < X21 )
               leftChild [ gen: 3 ] ( classified to: 1, samples: 4 )
               rightChild [ gen: 3 ] ( classified to: 0, samples: 29 )
     rightChild [ gen: 1 ] ( classified to: 0, samples: 18 )

Tree 2

root [ gen: 0 ] ( 492 < X2 )
     leftChild [ gen: 1 ] ( 1 < X22 )
          leftChild [ gen: 2 ] ( 10 < X3 )
               leftChild [ gen: 3 ] ( 249 < X2 )
                    leftChild [ gen: 4 ] ( 43 < X27 )
                         leftChild [ gen: 5 ] ( classified to: 1, samples: 31 )
                         rightChild [ gen: 5 ] ( 676 < X1 )
                              leftChild [ gen: 6 ] ( classified to: 1, samples: 4 )
                              rightChild [ gen: 6 ] ( classified to: 0, samples: 1 )
                    rightChild [ gen: 4 ] ( 286 < X2 )
                         leftChild [ gen: 5 ] ( classified to: 0, samples: 2 )
                         rightChild [ gen: 5 ] ( classified to: 1, samples: 4 )
               rightChild [ gen: 3 ] ( 2 < X2 )
                    leftChild [ gen: 4 ] ( classified to: 0, samples: 9 )
                    rightChild [ gen: 4 ] ( 2 < X33 )
                         leftChild [ gen: 5 ] ( 583 < X1 )
                              leftChild [ gen: 6 ] ( classified to: 0, samples: 4 )
                              rightChild [ gen: 6 ] ( 852 < X1 )
                                   leftChild [ gen: 7 ] ( classified to: 1, samples: 4 )
                                   rightChild [ gen: 7 ] ( classified to: 0, samples: 1 )
                         rightChild [ gen: 5 ] ( classified to: 1, samples: 10 )
          rightChild [ gen: 2 ] ( 223 < X7 )
               leftChild [ gen: 3 ] ( 13 < X33 )
                    leftChild [ gen: 4 ] ( 1 < X7 )
                         leftChild [ gen: 5 ] ( 797 < X1 )
                              leftChild [ gen: 6 ] ( classified to: 1, samples: 7 )
                              rightChild [ gen: 6 ] ( classified to: 0, samples: 1 )
                         rightChild [ gen: 5 ] ( 23 < X22 )
                              leftChild [ gen: 6 ] ( 3 < X33 )
                                   leftChild [ gen: 7 ] ( 1 < X24 )
                                        leftChild [ gen: 8 ] ( classified to: 0, samples: 6 )
                                        rightChild [ gen: 8 ] ( 105 < X24 )
                                             leftChild [ gen: 9 ] ( 117 < X3 )
                                                  leftChild [ gen: 10 ] ( classified to: 1, samples: 8 )
                                                  rightChild [ gen: 10 ] ( classified to: 0, samples: 1 )
                                             rightChild [ gen: 9 ] ( classified to: 0, samples: 4 )
                                   rightChild [ gen: 7 ] ( 657 < X1 )
                                        leftChild [ gen: 8 ] ( classified to: 0, samples: 11 )
                                        rightChild [ gen: 8 ] ( classified to: 1, samples: 1 )
                              rightChild [ gen: 6 ] ( classified to: 0, samples: 31 )
                    rightChild [ gen: 4 ] ( 196 < X2 )
                         leftChild [ gen: 5 ] ( classified to: 1, samples: 7 )
                         rightChild [ gen: 5 ] ( 243 < X1 )
                              leftChild [ gen: 6 ] ( classified to: 0, samples: 1 )
                              rightChild [ gen: 6 ] ( classified to: 1, samples: 1 )
               rightChild [ gen: 3 ] ( 141 < X21 )
                    leftChild [ gen: 4 ] ( classified to: 1, samples: 12 )
                    rightChild [ gen: 4 ] ( 1 < X3 )
                         leftChild [ gen: 5 ] ( classified to: 0, samples: 1 )
                         rightChild [ gen: 5 ] ( classified to: 1, samples: 1 )
     rightChild [ gen: 1 ] ( 17 < X27 )
          leftChild [ gen: 2 ] ( 963 < X2 )
               leftChild [ gen: 3 ] ( classified to: 0, samples: 11 )
               rightChild [ gen: 3 ] ( 976 < X2 )
                    leftChild [ gen: 4 ] ( classified to: 1, samples: 2 )
                    rightChild [ gen: 4 ] ( classified to: 0, samples: 10 )
          rightChild [ gen: 2 ] ( classified to: 1, samples: 1 )

I'm trying to verify the validity of these generated trees.

azmfaridee commented 12 years ago

This calculation is wrong, the entropy values that it is calculating is wrong, as none of these split provides total classification, yet one of the split entropies is listed at 0. Need to fix this.

tryIndex 36
getMinEntropyOfFeature()
featureOutputPair [[3, 0], [3, 1], [3, 0], [3, 1], [3, 1], [3, 1], [3, 1], [3, 0], [3, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [5, 0], [5, 0], [6, 0], [6, 0], [7, 0], [9, 0], [9, 0], [11, 0], [13, 0], [13, 0], [13, 0], [13, 0], [45, 0], [45, 0]]
splitPoints: [1, 2, 3, 7]
getBestSplitAndMinEntropy()
upperEntropyOfSplit: 0.0 lowerEntropyOfSplit: 0.676941869781
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 1.0 lowerEntropyOfSplit: 0.605186576633
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 0.918295834054 lowerEntropyOfSplit: 0.619382194679
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 0.863120568567 lowerEntropyOfSplit: 0.0
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
entropies: [0.676941869780886, 0.6051865766334206, 0.6193821946787638, 0.0]

azmfaridee commented 12 years ago

This calculation is wrong, the entropy values that it is calculating is wrong, as none of these split provides total classification, yet one of the split entropies is listed at 0. Need to fix this.

tryIndex 36
getMinEntropyOfFeature()
featureOutputPair [[3, 0], [3, 1], [3, 0], [3, 1], [3, 1], [3, 1], [3, 1], [3, 0], [3, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [5, 0], [5, 0], [6, 0], [6, 0], [7, 0], [9, 0], [9, 0], [11, 0], [13, 0], [13, 0], [13, 0], [13, 0], [45, 0], [45, 0]]
splitPoints: [1, 2, 3, 7]
getBestSplitAndMinEntropy()
upperEntropyOfSplit: 0.0 lowerEntropyOfSplit: 0.676941869781
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 1.0 lowerEntropyOfSplit: 0.605186576633
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 0.918295834054 lowerEntropyOfSplit: 0.619382194679
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 0.863120568567 lowerEntropyOfSplit: 0.0
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
entropies: [0.676941869780886, 0.6051865766334206, 0.6193821946787638, 0.0]

The implementation is wrong, because the split points are calculated prior and these entropies are calculated at that split point, but the value of split point is calculated later, and if some of the values are same that belong to different split points, then the algorithm generates wrong answer. The implementation is wrong, because similar values would belong to a single split point in the first place.

azmfaridee commented 12 years ago

The implementation is wrong, because the split points are calculated prior and these entropies are calculated at that split point, but the value of split point is calculated later, and if some of the values are same that belong to different split points, then the algorithm generates wrong answer. The implementation is wrong, because similar values would belong to a single split point in the first place.

This issue has been solved in the last commit.

azmfaridee commented 12 years ago

Here is an sample tree generated from the updated code that has an error rate of 26.15%. i.e. it can successfully classify roughly 74% of the cases.

root [ gen: 0 ] ( 3 < X74 )
    leftChild [ gen: 1 ] ( 1 < X207 )
        leftChild [ gen: 2 ] ( 16 < X117 )
            leftChild [ gen: 3 ] ( 13 < X33 )
                leftChild [ gen: 4 ] ( 2 < X89 )
                    leftChild [ gen: 5 ] ( 3 < X677 )
                        leftChild [ gen: 6 ] ( 22 < X11 )
                            leftChild [ gen: 7 ] ( 1 < X9 )
                                leftChild [ gen: 8 ] ( 492 < X2 )
                                    leftChild [ gen: 9 ] ( 14 < X273 )
                                        leftChild [ gen: 10 ] ( 63 < X41 )
                                            leftChild [ gen: 11 ] ( 10 < X10 )
                                                leftChild [ gen: 12 ] ( 1 < X154 )
                                                    leftChild [ gen: 13 ] ( 1 < X409 )
                                                        leftChild [ gen: 14 ] ( classified to: 0, samples: 30 )
                                                        rightChild [ gen: 14 ] ( classified to: 1, samples: 1 )
                                                    rightChild [ gen: 13 ] ( 2 < X22 )
                                                        leftChild [ gen: 14 ] ( classified to: 1, samples: 1 )
                                                        rightChild [ gen: 14 ] ( classified to: 0, samples: 5 )
                                                rightChild [ gen: 12 ] ( classified to: 1, samples: 2 )
                                            rightChild [ gen: 11 ] ( classified to: 1, samples: 1 )
                                        rightChild [ gen: 10 ] ( classified to: 1, samples: 1 )
                                    rightChild [ gen: 9 ] ( classified to: 0, samples: 16 )
                                rightChild [ gen: 8 ] ( 24 < X100 )
                                    leftChild [ gen: 9 ] ( 1 < X672 )
                                        leftChild [ gen: 10 ] ( 3 < X144 )
                                            leftChild [ gen: 11 ] ( 845 < X1 )
                                                leftChild [ gen: 12 ] ( classified to: 1, samples: 33 )
                                                rightChild [ gen: 12 ] ( 1 < X24 )
                                                    leftChild [ gen: 13 ] ( classified to: 1, samples: 5 )
                                                    rightChild [ gen: 13 ] ( classified to: 0, samples: 1 )
                                            rightChild [ gen: 11 ] ( 2 < X31 )
                                                leftChild [ gen: 12 ] ( 60 < X9 )
                                                    leftChild [ gen: 13 ] ( classified to: 0, samples: 1 )
                                                    rightChild [ gen: 13 ] ( classified to: 1, samples: 1 )
                                                rightChild [ gen: 12 ] ( classified to: 0, samples: 1 )
                                        rightChild [ gen: 10 ] ( classified to: 0, samples: 1 )
                                    rightChild [ gen: 9 ] ( classified to: 0, samples: 2 )
                            rightChild [ gen: 7 ] ( classified to: 1, samples: 5 )
                        rightChild [ gen: 6 ] ( classified to: 1, samples: 4 )
                    rightChild [ gen: 5 ] ( 1 < X27 )
                        leftChild [ gen: 6 ] ( classified to: 1, samples: 2 )
                        rightChild [ gen: 6 ] ( classified to: 0, samples: 10 )
                rightChild [ gen: 4 ] ( 29 < X338 )
                    leftChild [ gen: 5 ] ( 6 < X44 )
                        leftChild [ gen: 6 ] ( 30 < X9 )
                            leftChild [ gen: 7 ] ( 245 < X24 )
                                leftChild [ gen: 8 ] ( 43 < X27 )
                                    leftChild [ gen: 9 ] ( classified to: 1, samples: 14 )
                                    rightChild [ gen: 9 ] ( classified to: 0, samples: 1 )
                                rightChild [ gen: 8 ] ( classified to: 0, samples: 1 )
                            rightChild [ gen: 7 ] ( classified to: 1, samples: 9 )
                        rightChild [ gen: 6 ] ( 36 < X34 )
                            leftChild [ gen: 7 ] ( classified to: 0, samples: 2 )
                            rightChild [ gen: 7 ] ( classified to: 1, samples: 1 )
                    rightChild [ gen: 5 ] ( classified to: 0, samples: 1 )
            rightChild [ gen: 3 ] ( classified to: 1, samples: 8 )
        rightChild [ gen: 2 ] ( classified to: 1, samples: 3 )
    rightChild [ gen: 1 ] ( 1 < X2 )
        leftChild [ gen: 2 ] ( 7 < X31 )
            leftChild [ gen: 3 ] ( classified to: 0, samples: 6 )
            rightChild [ gen: 3 ] ( 1 < X29 )
                leftChild [ gen: 4 ] ( classified to: 1, samples: 4 )
                rightChild [ gen: 4 ] ( classified to: 0, samples: 1 )
        rightChild [ gen: 2 ] ( classified to: 0, samples: 13 )
treeErrorRate: 0.261538461538

azmfaridee / mothur