datadave / GADS9-NYC-Spring2014-Students

Student repo for Spring 2014 Data Science Course at GA NYC
4 stars 22 forks source link

itertools with more than a few features? #134

Closed omicron-theta closed 10 years ago

omicron-theta commented 10 years ago

I tried using itertools to generate all the different combinations of features that could possibly throw into a classifier, and I ran into some scalability problems. With 33 features, running through them and creating combinations of length 1 to 33 yields roughly 17 billion sets of features. I ran a regression on each of the features individually and weeded out 9 of them, but the 24 remaining features still translate into ~17 million sets.

Just running itertools to find the sets of features crashed my computer so I shudder to think what trying to iterate a classifer on that list would look do to my system. Is there a better way to go about picking which combination of features would yield the best results. Obviously I could explore graphs and tables of datasets and omit those where there is no clear relationship, but I'd prefer a more systematic approach.

podopie commented 10 years ago
  1. If you're doing Random Forests, doing feature combinations is unnecessary, because the forests do that for you automatically. (Learn how to work with feature importances in Random Forests here or in the labs I included for last night's class!)

One other common approach to do this a bit more programmatically is the drop one approach:

  1. Run all features in an algorithm
  2. return back the AUC and accuracy scores (but your goal is to optimize for one of these!)
  3. iteratively drop one of the features, and refit. Append the AUC to a list and accuracy to a list, the index representing that feature being dropped.
  4. Once you've dropped each feature once, determine what the best AUC model or accuracy model was. Make that your new master list of features
  5. repeat steps 1-4 until you've optimized for what you're goal is (AUC or accuracy)

You can do this with some recursive magic! And I bet you can come up with a much cleaner solution than what I wrote.

from sklearn import datasets
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
iris = datasets.load_iris()

max_features = [0, 1, 2, 3]

def find_optimization(dataX, dataY, max_features):
    max_fit = MultinomialNB().fit(dataX[:, max_features], dataY)
    init_acc_score = max_fit.score(dataX[:, max_features], dataY)
    accs = []
    print 'TESTING AROUND:', max_features, init_acc_score
    if len(max_features) == 1:
        print 'Best Features found!', max_features, init_acc_score
    else:
        for i in range(len(max_features)):
            features = list(max_features)
            features.pop(i)
            clf = MultinomialNB().fit(dataX[:, features], dataY)
            acc = clf.score(dataX[:, features], dataY)
            print features, acc
            if acc > init_acc_score:
                accs.append(acc)
            else:
                accs.append(0)
        print accs
        if accs[1:] == accs[:-1]:
            print 'Best Features found!', max_features
        else:
            index_for_removed_feature = accs.index(max(accs))
            max_features.pop(index_for_removed_feature)
            find_optimization(dataX, dataY, max_features)

find_optimization(iris.data, iris.target, max_features)
podopie commented 10 years ago

@omicron-theta did this help at all? If so I can close this request. Thanks!

omicron-theta commented 10 years ago

Yes. I talked through the steps you provided with Joe on Saturday and it all makes sense. Thanks.