Closed azmfaridee closed 11 years ago
Although this was was dedicated to implementation of RRF algorithm, I also have spent a considerable amount of time reading the literature to further understand how-to improve our current implementation. Key points if interest when reading the literature was:
I have been reading those papers during the week:
Here is a summary of the runs from the normal Random Forest implementation that we are getting:
Dataset: outin.final.an.0.03.subsample.avg.shared Number of Traning Samples: 341 Number of Features (OTU): 4350 Number of Trees: 100 Average ForestWideErrorRate: 6% Time: 91 Minutes
Dataset: inpatient.final.an.0.03.subsample.avg.shared Number of Traning Samples: 187 Number of Features (OTU): 1653 Number of Trees: 100 Average ForestWideErrorRate: 27% Time: 10 minutes
Same data was run with 10000 trees by Kathyn in the Linux cluster, the average error came down to 22%, which is good but not that much great
This gives us some insights:
outin.final.an.0.03.subsample.avg.shared
dataset returns an accuracy of 6%, which looks impressive, but at the same time I'm skeptical that the tree are over-fitting the data, so I must take deeper look a this over-fitting
idea.These are some of the reason why I've been going through the papers that I mentioned in the previous comments.
Related Issues: #3, #14, #15, #16, #17
As per issue #17 we already have a Random Forest implementation that can classify incoming data. We also have error rate calculation and error rate based variable importance measure in place. Now all that remains is to select a subset of features that are most important to us. A detailed investigation of Regularized Random Forest Framework was done in Issue #13. There is also an R package that implements that on top of Random Forest Framework. Implementing this is one of our this weeks major tasks.
We'd also need to do associated parameter tuning and other performance improvements.
End of Week Deliverable
Notes:
From the initial proposal, this week we were supposed to be doing performance tuning on Random Forest Framework. But during issue #3 we discovered that what we are looking for is not a classification problem but feature selection algorithm. Therefore we'd also be using this week to implement Regularized Random Forest Framework for the feature selection.