azmfaridee / mothur

This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.
https://github.com/mothur/mothur
GNU General Public License v3.0
3 stars 1 forks source link

Week 9: Implement Regularized Radom Forest Framework on top of Current Random Forest Implementation and Tune Parameters #19

Closed azmfaridee closed 11 years ago

azmfaridee commented 12 years ago

Related Issues: #3, #14, #15, #16, #17

As per issue #17 we already have a Random Forest implementation that can classify incoming data. We also have error rate calculation and error rate based variable importance measure in place. Now all that remains is to select a subset of features that are most important to us. A detailed investigation of Regularized Random Forest Framework was done in Issue #13. There is also an R package that implements that on top of Random Forest Framework. Implementing this is one of our this weeks major tasks.

We'd also need to do associated parameter tuning and other performance improvements.

End of Week Deliverable

From the initial proposal, this week we were supposed to be doing performance tuning on Random Forest Framework. But during issue #3 we discovered that what we are looking for is not a classification problem but feature selection algorithm. Therefore we'd also be using this week to implement Regularized Random Forest Framework for the feature selection.

azmfaridee commented 12 years ago

Although this was was dedicated to implementation of RRF algorithm, I also have spent a considerable amount of time reading the literature to further understand how-to improve our current implementation. Key points if interest when reading the literature was:

I have been reading those papers during the week:

azmfaridee commented 12 years ago

Here is a summary of the runs from the normal Random Forest implementation that we are getting:

Dataset: outin.final.an.0.03.subsample.avg.shared Number of Traning Samples: 341 Number of Features (OTU): 4350 Number of Trees: 100 Average ForestWideErrorRate: 6% Time: 91 Minutes

Dataset: inpatient.final.an.0.03.subsample.avg.shared Number of Traning Samples: 187 Number of Features (OTU): 1653 Number of Trees: 100 Average ForestWideErrorRate: 27% Time: 10 minutes

Same data was run with 10000 trees by Kathyn in the Linux cluster, the average error came down to 22%, which is good but not that much great

This gives us some insights:

These are some of the reason why I've been going through the papers that I mentioned in the previous comments.