Open kdiverson opened 11 years ago
@rafi-kamal would you be interested in implementing this?
Yeah, I'm interested. But I have exams 3 weeks from now, so it might took some time. How about an SVM implementation?
Great. We have someone working on SVM right now. I think the random forest classification might be a good starting place since the algo is already there. Right now all classify.share does is output feature selection. It would be nice if we could implement a way to specify certain groups in the shared file as training data and test data and then actually do the classification on the test data with random forest.
I think I can do this :) I will start working after the exams. Before that I will try to learn more about random forest and explore the code base.
I've some free time now, I'm ready to start working on classification.
@kdiverson can you please explain the example a little bit? Do you want to measure the similarity between different classes? When training on the samples, do I have the design file, or only the shared file?
The example I was thinking of is, say we have groups A, B, C, D, E in the design file. It would be really great if the user could issue a command like classify.shared(train="A,B", test="C, D, E", ...) which would train the algo on groups A and B and then try to classify the test data as either A or B. I'm thinking of ways to answer the question "are the other groups more similar to A or B?" It would also be nice to have the option to have it just try and classify all the groups to see how well they separate from each other. The proximity matrix would be helpful for this.
This also means we'll have to alter they way classy.shared outputs data. It would be nice to show a confusion matrix and have the option to write out the proximity matrix which could then be plotted. Right now, all that's output is the feature rank. It would be great to have some way to output the classification as well. I think the confusion matrix is a good way to go. Have a look at the way R outputs confusion matrices for an example. There's a good example here: http://mkseo.pe.kr/stats/?p=220
Thanks for the example, I got it.
@kdiverson if only the confusion matrix is needed, I think this can be done in a very simple way. As far as I've understood, what the R package does is:
(i, j)
of classes, measure how many samples in class i
has been misclassified as class j
We already have 1, implementing 2 will be an easy task.@rafi-kamal you got it. Here's a basic example: https://en.wikipedia.org/wiki/Confusion_matrix
Here is my implementation: #43
@kdiverson and @darthxaher, I couldn't find any resource which explain in depth how to generate a predictive model using random forest and use it for classification, so I came up with an idea. Can you tell me if I'm going in the right way?
Let me explain it. Say, we need to do this:
classify.shared(train="A,B", test="C, D, E", ...)
A, B
C, D, E
using that predictive modelWe already have 1, we need 2 and 3. Now, what the Berkley article on random forest says is:
To classify a new object from an input vector, put the input vector down each of the trees in the forest. Each tree gives a classification, and we say the tree "votes" for that class. The forest chooses the classification having the most votes (over all the trees in the forest)
So it seems to me that, while growing the trees, we need to save the following information for each node of the tree:
RFTreeNode::splitFeatureIndex
)RFTreeNode::splitFeatureValue
)RFTreeNode::outputClass
)These information comprises the predictive model, while classifying test data (C, D, E
), we will use this model to rebuild the forest, and put the test sample down each of the trees in the forest, as the Berkley article says.
Is this the right way for classification?
We focused mostly on feature selection but it would be nice to have a strait classification option. An example I was thinking of would be say we have 4 groups in the shared file. Can we train on two of them and then see what the other two classify as? So if we have samples from persons A, B, C and D, if we train on persons A and B, then can we classify C and D to see if they are more like A, B or neither?