azmfaridee / mothur

This is GSoC2012 fork of 'Mothur'. We are trying to implement a number of 'Feature Selection' algorithms for microbial ecology data and incorporate them into mother's main codebase.
https://github.com/mothur/mothur
GNU General Public License v3.0
3 stars 1 forks source link

Implement classification #40

Open kdiverson opened 11 years ago

kdiverson commented 11 years ago

We focused mostly on feature selection but it would be nice to have a strait classification option. An example I was thinking of would be say we have 4 groups in the shared file. Can we train on two of them and then see what the other two classify as? So if we have samples from persons A, B, C and D, if we train on persons A and B, then can we classify C and D to see if they are more like A, B or neither?

kdiverson commented 11 years ago

@rafi-kamal would you be interested in implementing this?

rafi-kamal commented 11 years ago

Yeah, I'm interested. But I have exams 3 weeks from now, so it might took some time. How about an SVM implementation?

kdiverson commented 11 years ago

Great. We have someone working on SVM right now. I think the random forest classification might be a good starting place since the algo is already there. Right now all classify.share does is output feature selection. It would be nice if we could implement a way to specify certain groups in the shared file as training data and test data and then actually do the classification on the test data with random forest.

rafi-kamal commented 11 years ago

I think I can do this :) I will start working after the exams. Before that I will try to learn more about random forest and explore the code base.

rafi-kamal commented 11 years ago

I've some free time now, I'm ready to start working on classification.

rafi-kamal commented 11 years ago

@kdiverson can you please explain the example a little bit? Do you want to measure the similarity between different classes? When training on the samples, do I have the design file, or only the shared file?

kdiverson commented 11 years ago

The example I was thinking of is, say we have groups A, B, C, D, E in the design file. It would be really great if the user could issue a command like classify.shared(train="A,B", test="C, D, E", ...) which would train the algo on groups A and B and then try to classify the test data as either A or B. I'm thinking of ways to answer the question "are the other groups more similar to A or B?" It would also be nice to have the option to have it just try and classify all the groups to see how well they separate from each other. The proximity matrix would be helpful for this.

This also means we'll have to alter they way classy.shared outputs data. It would be nice to show a confusion matrix and have the option to write out the proximity matrix which could then be plotted. Right now, all that's output is the feature rank. It would be great to have some way to output the classification as well. I think the confusion matrix is a good way to go. Have a look at the way R outputs confusion matrices for an example. There's a good example here: http://mkseo.pe.kr/stats/?p=220

rafi-kamal commented 11 years ago

Thanks for the example, I got it.

rafi-kamal commented 11 years ago

@kdiverson if only the confusion matrix is needed, I think this can be done in a very simple way. As far as I've understood, what the R package does is:

  1. Train the algorithm on samples from all classes
  2. For all pair (i, j) of classes, measure how many samples in class i has been misclassified as class j We already have 1, implementing 2 will be an easy task.
kdiverson commented 11 years ago

@rafi-kamal you got it. Here's a basic example: https://en.wikipedia.org/wiki/Confusion_matrix

rafi-kamal commented 11 years ago

Here is my implementation: #43

rafi-kamal commented 11 years ago

@kdiverson and @darthxaher, I couldn't find any resource which explain in depth how to generate a predictive model using random forest and use it for classification, so I came up with an idea. Can you tell me if I'm going in the right way?

Let me explain it. Say, we need to do this: classify.shared(train="A,B", test="C, D, E", ...)

  1. Train the algorithm on samples from class A, B
  2. Generate a predictive model during the training process
  3. Classify the samples from class C, D, E using that predictive model

We already have 1, we need 2 and 3. Now, what the Berkley article on random forest says is:

To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest)

So it seems to me that, while growing the trees, we need to save the following information for each node of the tree:

  1. The feature to split on (RFTreeNode::splitFeatureIndex)
  2. The value where we are going to make that split (RFTreeNode::splitFeatureValue)
  3. If the node is a leaf node, it's output class (RFTreeNode::outputClass)

These information comprises the predictive model, while classifying test data (C, D, E), we will use this model to rebuild the forest, and put the test sample down each of the trees in the forest, as the Berkley article says.

Is this the right way for classification?