Closed azmfaridee closed 12 years ago
Right now the most pressing issue is the handing of the sparse data that we have, a lot of features are zero filled, while others are just same data repeated over. This features need to be avoided, but this might also hurt the decision tree. I'm currently investigating ways to find the tradeoff between them
Here are two sample tree generated for inpatient.final.an.0.03.subsample.avg.matrix
file. The file contains 187
training samples and 845
features (OTUs).
root [ gen: 0 ] ( 536 < X2 )
leftChild [ gen: 1 ] ( 42 < X22 )
leftChild [ gen: 2 ] ( 9 < X31 )
leftChild [ gen: 3 ] ( 2 < X36 )
leftChild [ gen: 4 ] ( 8 < X33 )
leftChild [ gen: 5 ] ( 3 < X37 )
leftChild [ gen: 6 ] ( 1 < X2 )
leftChild [ gen: 7 ] ( classified to: 0, samples: 13 )
rightChild [ gen: 7 ] ( 131 < X7 )
leftChild [ gen: 8 ] ( 2 < X24 )
leftChild [ gen: 9 ] ( 1 < X3 )
leftChild [ gen: 10 ] ( classified to: 0, samples: 2 )
rightChild [ gen: 10 ] ( 1056 < X1 )
leftChild [ gen: 11 ] ( classified to: 1, samples: 18 )
rightChild [ gen: 11 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 9 ] ( 302 < X1 )
leftChild [ gen: 10 ] ( classified to: 1, samples: 3 )
rightChild [ gen: 10 ] ( classified to: 0, samples: 6 )
rightChild [ gen: 8 ] ( classified to: 0, samples: 6 )
rightChild [ gen: 6 ] ( classified to: 1, samples: 10 )
rightChild [ gen: 5 ] ( classified to: 1, samples: 14 )
rightChild [ gen: 4 ] ( 61 < X27 )
leftChild [ gen: 5 ] ( 997 < X7 )
leftChild [ gen: 6 ] ( classified to: 0, samples: 25 )
rightChild [ gen: 6 ] ( classified to: 1, samples: 1 )
rightChild [ gen: 5 ] ( classified to: 1, samples: 3 )
rightChild [ gen: 3 ] ( 27 < X31 )
leftChild [ gen: 4 ] ( classified to: 1, samples: 21 )
rightChild [ gen: 4 ] ( 9 < X22 )
leftChild [ gen: 5 ] ( 52 < X34 )
leftChild [ gen: 6 ] ( classified to: 1, samples: 8 )
rightChild [ gen: 6 ] ( classified to: 0, samples: 2 )
rightChild [ gen: 5 ] ( classified to: 0, samples: 3 )
rightChild [ gen: 2 ] ( 1 < X21 )
leftChild [ gen: 3 ] ( classified to: 1, samples: 4 )
rightChild [ gen: 3 ] ( classified to: 0, samples: 29 )
rightChild [ gen: 1 ] ( classified to: 0, samples: 18 )
root [ gen: 0 ] ( 492 < X2 )
leftChild [ gen: 1 ] ( 1 < X22 )
leftChild [ gen: 2 ] ( 10 < X3 )
leftChild [ gen: 3 ] ( 249 < X2 )
leftChild [ gen: 4 ] ( 43 < X27 )
leftChild [ gen: 5 ] ( classified to: 1, samples: 31 )
rightChild [ gen: 5 ] ( 676 < X1 )
leftChild [ gen: 6 ] ( classified to: 1, samples: 4 )
rightChild [ gen: 6 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 4 ] ( 286 < X2 )
leftChild [ gen: 5 ] ( classified to: 0, samples: 2 )
rightChild [ gen: 5 ] ( classified to: 1, samples: 4 )
rightChild [ gen: 3 ] ( 2 < X2 )
leftChild [ gen: 4 ] ( classified to: 0, samples: 9 )
rightChild [ gen: 4 ] ( 2 < X33 )
leftChild [ gen: 5 ] ( 583 < X1 )
leftChild [ gen: 6 ] ( classified to: 0, samples: 4 )
rightChild [ gen: 6 ] ( 852 < X1 )
leftChild [ gen: 7 ] ( classified to: 1, samples: 4 )
rightChild [ gen: 7 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 5 ] ( classified to: 1, samples: 10 )
rightChild [ gen: 2 ] ( 223 < X7 )
leftChild [ gen: 3 ] ( 13 < X33 )
leftChild [ gen: 4 ] ( 1 < X7 )
leftChild [ gen: 5 ] ( 797 < X1 )
leftChild [ gen: 6 ] ( classified to: 1, samples: 7 )
rightChild [ gen: 6 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 5 ] ( 23 < X22 )
leftChild [ gen: 6 ] ( 3 < X33 )
leftChild [ gen: 7 ] ( 1 < X24 )
leftChild [ gen: 8 ] ( classified to: 0, samples: 6 )
rightChild [ gen: 8 ] ( 105 < X24 )
leftChild [ gen: 9 ] ( 117 < X3 )
leftChild [ gen: 10 ] ( classified to: 1, samples: 8 )
rightChild [ gen: 10 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 9 ] ( classified to: 0, samples: 4 )
rightChild [ gen: 7 ] ( 657 < X1 )
leftChild [ gen: 8 ] ( classified to: 0, samples: 11 )
rightChild [ gen: 8 ] ( classified to: 1, samples: 1 )
rightChild [ gen: 6 ] ( classified to: 0, samples: 31 )
rightChild [ gen: 4 ] ( 196 < X2 )
leftChild [ gen: 5 ] ( classified to: 1, samples: 7 )
rightChild [ gen: 5 ] ( 243 < X1 )
leftChild [ gen: 6 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 6 ] ( classified to: 1, samples: 1 )
rightChild [ gen: 3 ] ( 141 < X21 )
leftChild [ gen: 4 ] ( classified to: 1, samples: 12 )
rightChild [ gen: 4 ] ( 1 < X3 )
leftChild [ gen: 5 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 5 ] ( classified to: 1, samples: 1 )
rightChild [ gen: 1 ] ( 17 < X27 )
leftChild [ gen: 2 ] ( 963 < X2 )
leftChild [ gen: 3 ] ( classified to: 0, samples: 11 )
rightChild [ gen: 3 ] ( 976 < X2 )
leftChild [ gen: 4 ] ( classified to: 1, samples: 2 )
rightChild [ gen: 4 ] ( classified to: 0, samples: 10 )
rightChild [ gen: 2 ] ( classified to: 1, samples: 1 )
I'm trying to verify the validity of these generated trees.
This calculation is wrong, the entropy values that it is calculating is wrong, as none of these split provides total classification, yet one of the split entropies is listed at 0. Need to fix this.
tryIndex 36
getMinEntropyOfFeature()
featureOutputPair [[3, 0], [3, 1], [3, 0], [3, 1], [3, 1], [3, 1], [3, 1], [3, 0], [3, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [5, 0], [5, 0], [6, 0], [6, 0], [7, 0], [9, 0], [9, 0], [11, 0], [13, 0], [13, 0], [13, 0], [13, 0], [45, 0], [45, 0]]
splitPoints: [1, 2, 3, 7]
getBestSplitAndMinEntropy()
upperEntropyOfSplit: 0.0 lowerEntropyOfSplit: 0.676941869781
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 1.0 lowerEntropyOfSplit: 0.605186576633
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 0.918295834054 lowerEntropyOfSplit: 0.619382194679
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
upperEntropyOfSplit: 0.863120568567 lowerEntropyOfSplit: 0.0
numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29
entropies: [0.676941869780886, 0.6051865766334206, 0.6193821946787638, 0.0]
This calculation is wrong, the entropy values that it is calculating is wrong, as none of these split provides total classification, yet one of the split entropies is listed at 0. Need to fix this.
tryIndex 36 getMinEntropyOfFeature() featureOutputPair [[3, 0], [3, 1], [3, 0], [3, 1], [3, 1], [3, 1], [3, 1], [3, 0], [3, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [4, 0], [5, 0], [5, 0], [6, 0], [6, 0], [7, 0], [9, 0], [9, 0], [11, 0], [13, 0], [13, 0], [13, 0], [13, 0], [45, 0], [45, 0]] splitPoints: [1, 2, 3, 7] getBestSplitAndMinEntropy() upperEntropyOfSplit: 0.0 lowerEntropyOfSplit: 0.676941869781 numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29 upperEntropyOfSplit: 1.0 lowerEntropyOfSplit: 0.605186576633 numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29 upperEntropyOfSplit: 0.918295834054 lowerEntropyOfSplit: 0.619382194679 numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29 upperEntropyOfSplit: 0.863120568567 lowerEntropyOfSplit: 0.0 numLessThanValueAtSplitPoint: 0 numGreaterThanValueAtSplitPoint: 29 entropies: [0.676941869780886, 0.6051865766334206, 0.6193821946787638, 0.0]
The implementation is wrong, because the split points are calculated prior and these entropies are calculated at that split point, but the value of split point is calculated later, and if some of the values are same that belong to different split points, then the algorithm generates wrong answer. The implementation is wrong, because similar values would belong to a single split point in the first place.
The implementation is wrong, because the split points are calculated prior and these entropies are calculated at that split point, but the value of split point is calculated later, and if some of the values are same that belong to different split points, then the algorithm generates wrong answer. The implementation is wrong, because similar values would belong to a single split point in the first place.
This issue has been solved in the last commit.
Here is an sample tree generated from the updated code that has an error rate of 26.15%. i.e. it can successfully classify roughly 74% of the cases.
root [ gen: 0 ] ( 3 < X74 )
leftChild [ gen: 1 ] ( 1 < X207 )
leftChild [ gen: 2 ] ( 16 < X117 )
leftChild [ gen: 3 ] ( 13 < X33 )
leftChild [ gen: 4 ] ( 2 < X89 )
leftChild [ gen: 5 ] ( 3 < X677 )
leftChild [ gen: 6 ] ( 22 < X11 )
leftChild [ gen: 7 ] ( 1 < X9 )
leftChild [ gen: 8 ] ( 492 < X2 )
leftChild [ gen: 9 ] ( 14 < X273 )
leftChild [ gen: 10 ] ( 63 < X41 )
leftChild [ gen: 11 ] ( 10 < X10 )
leftChild [ gen: 12 ] ( 1 < X154 )
leftChild [ gen: 13 ] ( 1 < X409 )
leftChild [ gen: 14 ] ( classified to: 0, samples: 30 )
rightChild [ gen: 14 ] ( classified to: 1, samples: 1 )
rightChild [ gen: 13 ] ( 2 < X22 )
leftChild [ gen: 14 ] ( classified to: 1, samples: 1 )
rightChild [ gen: 14 ] ( classified to: 0, samples: 5 )
rightChild [ gen: 12 ] ( classified to: 1, samples: 2 )
rightChild [ gen: 11 ] ( classified to: 1, samples: 1 )
rightChild [ gen: 10 ] ( classified to: 1, samples: 1 )
rightChild [ gen: 9 ] ( classified to: 0, samples: 16 )
rightChild [ gen: 8 ] ( 24 < X100 )
leftChild [ gen: 9 ] ( 1 < X672 )
leftChild [ gen: 10 ] ( 3 < X144 )
leftChild [ gen: 11 ] ( 845 < X1 )
leftChild [ gen: 12 ] ( classified to: 1, samples: 33 )
rightChild [ gen: 12 ] ( 1 < X24 )
leftChild [ gen: 13 ] ( classified to: 1, samples: 5 )
rightChild [ gen: 13 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 11 ] ( 2 < X31 )
leftChild [ gen: 12 ] ( 60 < X9 )
leftChild [ gen: 13 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 13 ] ( classified to: 1, samples: 1 )
rightChild [ gen: 12 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 10 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 9 ] ( classified to: 0, samples: 2 )
rightChild [ gen: 7 ] ( classified to: 1, samples: 5 )
rightChild [ gen: 6 ] ( classified to: 1, samples: 4 )
rightChild [ gen: 5 ] ( 1 < X27 )
leftChild [ gen: 6 ] ( classified to: 1, samples: 2 )
rightChild [ gen: 6 ] ( classified to: 0, samples: 10 )
rightChild [ gen: 4 ] ( 29 < X338 )
leftChild [ gen: 5 ] ( 6 < X44 )
leftChild [ gen: 6 ] ( 30 < X9 )
leftChild [ gen: 7 ] ( 245 < X24 )
leftChild [ gen: 8 ] ( 43 < X27 )
leftChild [ gen: 9 ] ( classified to: 1, samples: 14 )
rightChild [ gen: 9 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 8 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 7 ] ( classified to: 1, samples: 9 )
rightChild [ gen: 6 ] ( 36 < X34 )
leftChild [ gen: 7 ] ( classified to: 0, samples: 2 )
rightChild [ gen: 7 ] ( classified to: 1, samples: 1 )
rightChild [ gen: 5 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 3 ] ( classified to: 1, samples: 8 )
rightChild [ gen: 2 ] ( classified to: 1, samples: 3 )
rightChild [ gen: 1 ] ( 1 < X2 )
leftChild [ gen: 2 ] ( 7 < X31 )
leftChild [ gen: 3 ] ( classified to: 0, samples: 6 )
rightChild [ gen: 3 ] ( 1 < X29 )
leftChild [ gen: 4 ] ( classified to: 1, samples: 4 )
rightChild [ gen: 4 ] ( classified to: 0, samples: 1 )
rightChild [ gen: 2 ] ( classified to: 0, samples: 13 )
treeErrorRate: 0.261538461538
Related Issues: #3, #14, #15
As the per the initial proposal, code the remaining part of the Random Forest algorithm and finish the implementations. Most of this weeks task week be wrapping up the works that was done in Week 5 and Week 6. Since we are now able to create decision trees from the bootstrapped data, it's now time to aggregate the results and build the ensemble classifier feature selector. Specific functions of interests are:
End of Week Deliverable