This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.
Once the code has been implemented, we'd need to do parameter tuning to find the sweet spot between speed and performance. The major parameters that we might need to tune are.
Number of trees
Number variables in the subspace. If our train data has N variables/attributes but in each split of the
tree we’d need to consider a subset of them, with a number P. We’d need to tune the parameter.
Number of nodes in trees
Entropy (information gain) criteria: Tan-Steinbach-Kumar’sData Mining textbook has a decent guide for setting this. We'd want to follow that lead.
We'd like to gather more resources from the web and investigate the best practices out there for these parameter tuning.
Once the code has been implemented, we'd need to do parameter tuning to find the sweet spot between speed and performance. The major parameters that we might need to tune are.
We'd like to gather more resources from the web and investigate the best practices out there for these parameter tuning.