Closed omicron-theta closed 10 years ago
One other common approach to do this a bit more programmatically is the drop one approach:
You can do this with some recursive magic! And I bet you can come up with a much cleaner solution than what I wrote.
from sklearn import datasets
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score
iris = datasets.load_iris()
max_features = [0, 1, 2, 3]
def find_optimization(dataX, dataY, max_features):
max_fit = MultinomialNB().fit(dataX[:, max_features], dataY)
init_acc_score = max_fit.score(dataX[:, max_features], dataY)
accs = []
print 'TESTING AROUND:', max_features, init_acc_score
if len(max_features) == 1:
print 'Best Features found!', max_features, init_acc_score
else:
for i in range(len(max_features)):
features = list(max_features)
features.pop(i)
clf = MultinomialNB().fit(dataX[:, features], dataY)
acc = clf.score(dataX[:, features], dataY)
print features, acc
if acc > init_acc_score:
accs.append(acc)
else:
accs.append(0)
print accs
if accs[1:] == accs[:-1]:
print 'Best Features found!', max_features
else:
index_for_removed_feature = accs.index(max(accs))
max_features.pop(index_for_removed_feature)
find_optimization(dataX, dataY, max_features)
find_optimization(iris.data, iris.target, max_features)
@omicron-theta did this help at all? If so I can close this request. Thanks!
Yes. I talked through the steps you provided with Joe on Saturday and it all makes sense. Thanks.
I tried using itertools to generate all the different combinations of features that could possibly throw into a classifier, and I ran into some scalability problems. With 33 features, running through them and creating combinations of length 1 to 33 yields roughly 17 billion sets of features. I ran a regression on each of the features individually and weeded out 9 of them, but the 24 remaining features still translate into ~17 million sets.
Just running itertools to find the sets of features crashed my computer so I shudder to think what trying to iterate a classifer on that list would look do to my system. Is there a better way to go about picking which combination of features would yield the best results. Obviously I could explore graphs and tables of datasets and omit those where there is no clear relationship, but I'd prefer a more systematic approach.