accord-net / framework

Machine learning, computer vision, statistics and general scientific computing for .NET
http://accord-framework.net
GNU Lesser General Public License v2.1
4.49k stars 1.99k forks source link

Selecting best subset(s) of all features for learning #836

Open ConductedClever opened 7 years ago

ConductedClever commented 7 years ago

What would you like to submit? (put an 'x' inside the bracket that applies)

Issue description

Hi. I need to figure out which of my features are better to use (for example in NaiveBayesLearning algorithm). So I am thinking of selecting many (may be all) subsets of features and test them with k-fold cross validation to see which subset is better (One solution will be to put one feature out each time and one to check all subsets.).

does Accord.net internally implement this feature?

Thanks in advance.

cesarsouza commented 7 years ago

Hi @ConductedClever,

Many thanks for opening the issue! While the framework does not offer feature selection in the way you have just mentioned, it is possible to achieve feature selection using a slightly different (but more efficient) way using L1-regularized logistic support vector machines. For an example, please take a look at the feature selection sample application, where features are selected using L1-regularized logistic SVMs.

Otherwise, if you would really like to perform feature selection in the way you have just described, it might be possible to generate all feature combinations using Accord.NET's Combinatorics.Combinations method coupled with k-fold cross validation as you have just mentioned. If you would like to proceed this way, please consider sending us a pull request with this functionality if it works for you! :-)

Regards, Cesar

ConductedClever commented 7 years ago

Thanks @cesarsouza,

I think it will be useful to try both solutions you've mentioned (Although as you said using L1-regularized logistic support vector machines should be better).

I took a look at Accord.NET's Combinatorics.Combinations method and found it very useful for this purpose, and I have one more question about it. If I do this method on input data, how should I know that the each combination belongs to which features?

A simple solution that comes to my mind is to add feature titles as the first row of the matrix before making combinations and then remove them before using them for learning purpose, but maybe better ways exist.

Thanks a lot.

cesarsouza commented 7 years ago

Hi @ConductedClever,

The Combinations method accepts a generic argument - meaning it should work with lists of anything you can pass to it. You can pass a vector of integers, which in this case you can interpret each integer as the index of the column that contains the feature, or you could also pass strings with the name of the feature, and so on.

Also, since we are talking about selecting the best feature, the order of the features should not matter. As such, maybe it should be better to use the Combinatorics.Subsets method which should be able to give you all possible subsets of your set of features.

Regards, Cesar

cesarsouza commented 7 years ago

An example would be:

// Let's say you have 4 features in total and you would like to consider
// all possible combinations without repetition and where order does
// not matter of those features:

double[][] matrix = // ... this is a N x 4 matrix where N is the number of samples in your training problem

// Let's consider all 4 possible features:
ISet<int> set = new HashSet<int> { 0, 1, 2, 3 }; 

// The number of possible subsets might be too large
// to fit on memory. For this reason, we can compute
// values on-the-fly using foreach:

foreach (SortedSet<int> subset in Combinatorics.Subsets(set))
{
     int[] featureIndices = subset.ToArray();
     double[][] subMatrix = matrix.Get(featureIndices);

     // Learn using subMatrix ...
}
ConductedClever commented 7 years ago

Hi @cesarsouza,

Your attachment about Combinatorics.Subsets method is very useful.

And about your solution with example, it is very good. I didn't know the Get method existence. So I think your solution is very very helpful.

Thanks a lot.