azmfaridee / mothur

This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.
https://github.com/mothur/mothur
GNU General Public License v3.0
3 stars 1 forks source link

Classification/Regression or Both! #3

Closed azmfaridee closed 11 years ago

azmfaridee commented 12 years ago

Depending on the question we wish to ask to our knowledge-base after we have trained our dataset, the problem can be either categorized as a Classification or a Regression problem.

For example. After training the AmazonDataset and using the Shared File in the example, if we ask the ratio of the OTU1 and OUT7 given the ratio of the other OTU , it would be a Regression problem.

On the other hand, given the ratio of the OTU if we ask if the dataset belongs to pasture or rainforest and we expect a Yes or No answer, this becomes a Classification problem.

We need to define whether we would approach to this problem as a classification or regression, or we need to have both type of questions answered, if so, then what would be the best way to design the algorithm. Do we train two times or is there a special kind of way to run Classification and Regression at the same time.

Child Issues: #11

mothur-westcott commented 12 years ago

If we ask the ratio of the OTU1 and OUT7 given the ratio of the other OTU , it would be a classification problem.

I am not sure I understand what you are asking. This may be better answered by Kathryn, but could you try explaining it a little more?

azmfaridee commented 12 years ago

If we ask the ratio of the OTU1 and OUT7 given the ratio of the other OTU , it would be a classification problem.

I am not sure I understand what you are asking. This may be better answered by Kathryn, but could you try explaining it a little more?

@mothur-westcott : Ah, sorry for the typo, it should be "If we ask the ratio of the OTU1 and OUT7 given the ratio of the other OTU , it would be a Regression problem." I've corrected this in the issue.

I've already talked with @kdiverson about this issue and she also agreed that it might be a pressing issue. Say for example we have a training set T where we have 20 OTU and two regions Forest and Pasture. Now, given the value/ratio of OTU 1 to 6 and 9 to 20 say 2 : 4 : 9 : 20 : ... ... ... : 10, we'd like to know the ratio between OTU 7 and OTU 8. Since the Ratio can be anything and it does not come from a finite set, asking this kind of question to a system would effectively make it a Regression problem.

On the other hand, given all the Ratio among the OTU if you ask if that one comes from Forest or Pasture effectively locking the answer between these two, so we are now addressing a Classification problem.

kdiverson commented 12 years ago

I thought about this a little more last night. The regression question would only be useful for making comparisons if the user can guarantee the OTUs are the same. OTU1 could be anything and it can be different for different datasets. For example, if I generate a list of OTUs from some site, let's say the ocean and then a few months later I get a new sample from a lake and generate OTUs from that. Now OTU1 from ocean will be different from OTU1 in the lake sample unless both ocean and lake are re-classified together. If we are going to try the regression method we must make sure OTU1 means the same thing in all the data that's being compared. If you wanted to get really crazy, one way around this could be say we have two datasets already classified separately into OTUs. Could the algorithm look at the ratios and make a prediction based on the OTU ratios in both datasets? What is called OTU1 in dataset one is most likely called OTU4 in the other? One thing that would be really cool from a biological standpoint is if OTU1 and OTU4 from the above example had different sequences but the algorithm says they could be the same thing. This may suggest a similar ecological role for both OTUs.

Sorry if this is really confusing, I was thinking as I was typing. I'll try to clean this up as it gets clearer in my mind.

mothur-westcott commented 12 years ago

The output from the classify.otu command might be helpful with this idea, or maybe the new create.database command.

kdiverson commented 12 years ago

I talked to Pat about this and the idea is to train the algorithm on a dataset, pasture for example, and then run the classifier on the same data again (hopefully having it classify as pasture). What we want to get out of it is "what features were important in making the classification?" These features would then be characteristic of pasture, and that's what ecologists want to know. So the goal from a biological standpoint is not to predict for unknown samples, because you'll know where the sample came from. The idea is if we train the algorithm to make these kind of predictions, we want to know what features are most important in making these predictions. The classification is just a validation that the features used were correct. One other example for clarity: say we have a bunch of samples from a cancerous microbiome. We then randomly split that data in half and use the first half for training. Then we attempt to classify the second half. If the algorithm correctly identifies the second half of the dataset as cancerous then we will know that it has identified features characteristic of a cancerous microbiome. If it doesn't then the features it found in the first dataset are not the correct ones and it will need to re-train until it is able to find the correct features. So we're not interested in the classification itself, we're more interested in the list of features that were used to make the classification. Does this make sense?

azmfaridee commented 12 years ago

So we're not interested in the classification itself, we're more interested in the list of features that were used to make the classification. Does this make sense?

RandomForest has a concept of Variable Importance where we need to find the relative importance of variables in the decision making process. For example, if we have 10 features, but only 5 of them actually make contributions into the decision making process, then this 5 will be identified, by putting all the features in different parts of the decision tree and measuring if there is perturbation in the score. If there is, then the variable is important.

Would this help? I mentioned two methods related to this issue in the application, they are:

Let me know if this is what you are looking for. I'd give more thought to this when my head is clear later, I just got up from sleep :)

azmfaridee commented 12 years ago

One other example for clarity: say we have a bunch of samples from a cancerous microbiome. We then randomly split that data in half and use the first half for training. Then we attempt to classify the second half. If the algorithm correctly identifies the second half of the dataset as cancerous then we will know that it has identified features characteristic of a cancerous microbiome. If it doesn't then the features it found in the first dataset are not the correct ones and it will need to re-train until it is able to find the correct features.

@kdiverson @mothur-westcott I've been thinking a bit more about the issue, if our problem set is basically identifying the important features rather than predicting outcome from new data, then I think this would mean we have been addressing the problem in a wrong way. Although Feature Selection is a sub-problem in Machine Learning domain, it's a bit different from typical Classification/Regression problems that we've been addressing so far.

Take a look at the wikipedia page and let me know. http://en.wikipedia.org/wiki/Feature_selection

I might be entirely wrong, but feel free to correct me. This is very impotent, as we might need to change our game plan entirely depending what we REALLY want from our program.

kdiverson commented 12 years ago

If we're going to go with feature selection, here's a link to a paper on feature selection via regularized random forest [0]. This is one possible algo for feature selection. There is also an R package that implements this [1].

[0] http://enpub.fulton.asu.edu/hdeng3/FSRegularizedTrees.pdf [1] http://cran.r-project.org/web/packages/RRF/index.html

azmfaridee commented 12 years ago

If we're going to go with feature selection, here's a link to a paper on feature selection via regularized random forest [0]. This is one possible algo for feature selection. There is also an R package that implements this [1].

[0] http://enpub.fulton.asu.edu/hdeng3/FSRegularizedTrees.pdf [1] http://cran.r-project.org/web/packages/RRF/index.html

I noticed this one in the morning while I dug up the Wikipedia entry, it looks interesting and it extends the idea of Variable Importance in vanilla Random Forest. While this would be the easiest algorithm for us for the migration because of the obvious similarity, I'd like to know more about the other alternatives. It would be nice if we bump into any survey paper that compares against different approaches for Feature Selection.

kdiverson commented 12 years ago

I think there are good parallels between the feature selection we're trying to do and feature selection in microarray data. Specifically the high dimensionality and small sample size. We may want to look into what's in the literature re: microarray feature selection.

azmfaridee commented 12 years ago

I think there are good parallels between the feature selection we're trying to do and feature selection in microarray data. Specifically the high dimensionality and small sample size. We may want to look into what's in the literature re: microarray feature selection.

@kdiverson Created Issue #11 for this. The first article has a section for Feature Selection For Microarray Analysis which could be helpful.