azmfaridee / mothur

This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.
https://github.com/mothur/mothur
GNU General Public License v3.0
3 stars 1 forks source link

Review the Literature of 'Parameter Selection' and Create a wiki Page to Document the Findings #11

Closed azmfaridee closed 11 years ago

azmfaridee commented 12 years ago

Parent Issue: #3

azmfaridee commented 12 years ago

Wiki page created with title Literature Review on Parameter Selection

azmfaridee commented 12 years ago

Added a summary of the Article titled A Review of Feature Selection Techniques in Bioinformatics. Among the three type of Feature Selection method (Filter, Wrapper and Embedded), we need to find the proper one for us.

Some of the criteria for selecting the best algorithm from this:

The above mentioned survey paper is a bit old (2007), so we might need to concentrate on a newly published paper. Also, we need to find a paper that can draw a statistical comparison among all the approaches, that would make things easier for us.

kdiverson commented 12 years ago

I think we need to make a decision about our feature selection algo in the next couple of days. The coding period starts next Monday, 21 May and we'll probably need at least a few days to outline the algo implementation. There's a literature review page on the wiki and we've talked about adapting an algo from microarray analysis. One other thing to consider is our algo will have to be good with sparse matrices.

azmfaridee commented 12 years ago

I think we need to make a decision about our feature selection algo in the next couple of days. The coding period starts next Monday, 21 May and we'll probably need at least a few days to outline the algo implementation. There's a literature review page on the wiki and we've talked about adapting an algo from microarray analysis. One other thing to consider is our algo will have to be good with sparse matrices.

@kdiverson: I agree on that. I have already come up with a plan.

I have been reviewing the paper titled Feature Selection via Regularized Trees written by Houtao Deng and George Runger which is a very recently published one (Published in 2012). What they did was use Regularized Random Forests (A specialization of vanilla Random Forests) as an Embedded Method to do the Feature Selection. Their evaluation statistics looks pretty promising. We are also reviewing the paper Gene Selection with Regularized Random Forest where empirically used this technique. So I think this is a good place to start. It has the added benefits of:

I'll be updating the wiki about this thought of mine, meanwhile I'd like to know what you are thinking. Does it all make sense at all?

@mothur-westcott what do you think?

azmfaridee commented 12 years ago

I went through the paper called An Introduction to Variable and Feature Selection authored by Isabelle Guyon and André Elisseeff. Among the many things they discusses, one particular interest was choosing between Variable Ranking vs Variable Subset Selection. Sometimes Variable Raking might not be a good idea, because there might be interdependence between the variables and individual ranking does not respect these interdependences. In those cases Variable Clustering could be a better option.

For our Microbial Ecology data there are a lot of interdependence between the microbes, so Variable Clustering could be a better option for us. We might need to choose a dedicated algorithm to discover the interdependence between the microbes.

@mothur-westcott @kdiverson Do you know of any such algorithms, that can cluster microbes according to their behavior?

azmfaridee commented 12 years ago

Is the ordering of the selected features are important in Random Forest? In the end we are just giving out a certain subset of features, but the ordering among them might be important for a Decision Tree based framework. We'd need to dig into this too.

mothur-westcott commented 12 years ago

@kdiverson Your outline looks good to me.
@darthxaher I don't know of any other algorithms. Kathryn is probably a better resource in that respect.

kdiverson commented 12 years ago

@darthxaher none that come to mind at the moment but I'll look into it. We don't really know the interdependence going into the analysis. We don't know what is dependent on what but from a biological standpoint we do know that there is something going on. Figuring out what is dependent on what would probably be something that came out of the feature selection.

kdiverson commented 12 years ago

I've just found the paper Supervised classification of human microbiota [0] and it's made me reconsider the best machine learning approach. They still suggest random forest (RF) as the best algo for classification but for feature selection they suggest combining RF with elastic net or ENET. I'm not familiar with ENET but it seems to be a good performer. It was also fairly good at feature reduction, although not the best. For feature reduction SVM-REF was recommended as it had a similar accuracy to RF but it used fewer features for the classification. This would be useful for defining a 'core' set of OTUs.

I'm still in favor of using RF since we have put a lot of work into that algo already (and it seems to be the darling child as evidenced by some recent papers) but I think we should at least look into ENET as a possible enhancement for feature selection.

Some more info on elastic net [1]. It looks like there's an R package [2] we can look at as well, if we decide to go this direction.

[0] http://onlinelibrary.wiley.com/doi/10.1111/j.1574-6976.2010.00251.x/full [1] https://en.wikipedia.org/wiki/Elastic_Net [2] http://www.jstatsoft.org/v33/i01

mothur-westcott commented 12 years ago

@kdiverson for link [0], I am getting a login screen. Is there another way to read it?

"Options for accessing this content: If you have access to this content through a society membership, please first log in to your society website. If you would like institutional access to this content, please recommend the title to your librarian. Login via Athens http://onlinelibrary.wiley.com/athens or other institutional login options http://onlinelibrary.wiley.com/login-options . You can purchase online access to this Article for a 24-hour period (price varies by title) "

kdiverson commented 12 years ago

@mothur-westcott oh, I guess you need to be on a university network, sorry about that. I can email you the pdf.

mothur-westcott commented 12 years ago

thanks, :)

azmfaridee commented 12 years ago

I've just found the paper Supervised classification of human microbiota [0] and it's made me reconsider the best machine learning approach. They still suggest random forest (RF) as the best algo for classification but for feature selection they suggest combining RF with elastic net or ENET. I'm not familiar with ENET but it seems to be a good performer. It was also fairly good at feature reduction, although not the best. For feature reduction SVM-REF was recommended as it had a similar accuracy to RF but it used fewer features for the classification. This would be useful for defining a 'core' set of OTUs.

I'm still in favor of using RF since we have put a lot of work into that algo already (and it seems to be the darling child as evidenced by some recent papers) but I think we should at least look into ENET as a possible enhancement for feature selection.

Some more info on elastic net [1]. It looks like there's an R package [2] we can look at as well, if we decide to go this direction.

[0] http://onlinelibrary.wiley.com/doi/10.1111/j.1574-6976.2010.00251.x/full [1] https://en.wikipedia.org/wiki/Elastic_Net [2] http://www.jstatsoft.org/v33/i01

@kdiverson: I created a folder called ENETandOtherRelatedPapers in the dropbox folder, could you download the pdf of [0] and drop it there? It would be nice if we could add @mothur-westcott to our shared folder too.

I just skimmed through [2], most of the paper talked about a regression model, whereas the problem we are dealing with is a feature selection in a embedded classifier method. I'm not completely sure how this fits our problem set, maybe when I read [0], I could judge better.

kdiverson commented 12 years ago

@darthxaher the paper was in the biology directory, moved to ENET.