Week 9: Implement Regularized Radom Forest Framework on top of Current Random Forest Implementation and Tune Parameters

azmfaridee commented 12 years ago

Related Issues: #3, #14, #15, #16, #17

As per issue #17 we already have a Random Forest implementation that can classify incoming data. We also have error rate calculation and error rate based variable importance measure in place. Now all that remains is to select a subset of features that are most important to us. A detailed investigation of Regularized Random Forest Framework was done in Issue #13. There is also an R package that implements that on top of Random Forest Framework. Implementing this is one of our this weeks major tasks.

We'd also need to do associated parameter tuning and other performance improvements.

End of Week Deliverable

An implementation that can do feature selection from the given dataset
Notes:

From the initial proposal, this week we were supposed to be doing performance tuning on Random Forest Framework. But during issue #3 we discovered that what we are looking for is not a classification problem but feature selection algorithm. Therefore we'd also be using this week to implement Regularized Random Forest Framework for the feature selection.

azmfaridee commented 12 years ago

Although this was was dedicated to implementation of RRF algorithm, I also have spent a considerable amount of time reading the literature to further understand how-to improve our current implementation. Key points if interest when reading the literature was:

Decision Tree Pruning
Implementation of Regularized Random Forest
Taking a deeper look at C4.5 for other tree optimization techniques

I have been reading those papers during the week:

An Empirical Comparison of Supervised Learning Algorithms
C4.5 algorithm and Multivariate Decision Trees
Decision Tree Discovery
Improved Use of Continuous Attributes in C4.5
Feature Selection via Regularized Trees
Random Forest Based Feature Induction

azmfaridee commented 12 years ago

Here is a summary of the runs from the normal Random Forest implementation that we are getting:

Dataset: outin.final.an.0.03.subsample.avg.shared Number of Traning Samples: 341 Number of Features (OTU): 4350 Number of Trees: 100 Average ForestWideErrorRate: 6% Time: 91 Minutes

Dataset: inpatient.final.an.0.03.subsample.avg.shared Number of Traning Samples: 187 Number of Features (OTU): 1653 Number of Trees: 100 Average ForestWideErrorRate: 27% Time: 10 minutes

Same data was run with 10000 trees by Kathyn in the Linux cluster, the average error came down to 22%, which is good but not that much great

This gives us some insights:

A tree number of 100 is quite good at predicting a good feature rank, while increasing the number of trees by 100 times increased the accuracy by 5%, the 100 times extra time is something that cannot be easily overlooked
There might be a way to prune the trees without loosing much of the accuracy, this increasing the performance quite a bit
The outin.final.an.0.03.subsample.avg.shared dataset returns an accuracy of 6%, which looks impressive, but at the same time I'm skeptical that the tree are over-fitting the data, so I must take deeper look a this over-fitting idea.

These are some of the reason why I've been going through the papers that I mentioned in the previous comments.

azmfaridee / mothur

Week 9: Implement Regularized Radom Forest Framework on top of Current Random Forest Implementation and Tune Parameters #19

Related Issues: #3, #14, #15, #16, #17

End of Week Deliverable

Notes: