EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.73k stars 1.57k forks source link

Using the tpot object for prediction #67

Closed kadarakos closed 8 years ago

kadarakos commented 8 years ago

Error with .predict for iris example

from tpot import TPOT
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split

digits = load_iris()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

tpot = TPOT(generations=10)
tpot.fit(X_train, y_train)
print(tpot.score(X_train, y_train, X_test, y_test))

But when I try to use the pipeline as a predictor

tpot.predict(X_train, y_train, X_test)

this is the error I get (iPython debugger output):

TypeError                                 Traceback (most recent call last)
<ipython-input-8-74abe9ee292a> in <module>()
----> 1 tpot.predict(X_train, y_train, X_test)

/usr/local/lib/python2.7/dist-packages/tpot/tpot.pyc in predict(self, training_features, training_classes, testing_features)
    290 
    291         result = func(training_testing_data)
--> 292         return result[result['group'] == 'testing', 'guess'].values
    293 
    294     def score(self, training_features, training_classes, testing_features, testing_classes):

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1656             return self._getitem_multilevel(key)
   1657         else:
-> 1658             return self._getitem_column(key)
   1659 
   1660     def _getitem_column(self, key):

/usr/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_column(self, key)
   1663         # get column
   1664         if self.columns.is_unique:
-> 1665             return self._get_item_cache(key)
   1666 
   1667         # duplicate columns & possible reduce dimensionaility

/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
   1001     def _get_item_cache(self, item):
   1002         cache = self._item_cache
-> 1003         res = cache.get(item)
   1004         if res is None:
   1005             values = self._data.get(item)

/usr/lib/python2.7/dist-packages/pandas/core/generic.pyc in __hash__(self)
    623     def __hash__(self):
    624         raise TypeError('{0!r} objects are mutable, thus they cannot be'
--> 625                         ' hashed'.format(self.__class__.__name__))
    626 
    627     def __iter__(self):

TypeError: 'Series' objects are mutable, thus they cannot be hashed

Interpreting generated code

Running the iris example generated this piece of code

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.tree import DecisionTreeClassifier

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, 
                                                                     n_iter=1, 
                                                                     train_size=0.75)))
result1 = tpot_data.copy()

# Perform classification with a decision tree classifier
dtc1 = DecisionTreeClassifier(max_features=min(83, len(result1.columns) - 1), max_depth=19)
dtc1.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces, 'class'].values)
result1['dtc1-classification'] = dtc1.predict(result1.drop('class', axis=1).values)

# Perform classification with a decision tree classifier
dtc2 = DecisionTreeClassifier(max_features='auto', max_depth=56)
dtc2.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces, 'class'].values)
result2 = result1
result2['dtc2-classification'] = dtc2.predict(result2.drop('class', axis=1).values)

I struggle a bit to understand what is the intended idea behind providing this result2 dataframe. So there are 2 classification results in the above example both with decision trees and with different hyper-parameters, but how do these get combined?

rhiever commented 8 years ago

Hi @kadarakos!

Error with .predict for iris example

Please check if the iris features and classes are encoded as numerical features. This is likely the source of your error. We've raised issue #61 to address this problem in the near future.

Interpreting generated code

Happy to see feedback about the generated code! The following is occurring in the pipeline you posted:

1) The training features and class labels are used to train the first decision tree

2) The class predictions from the first decision tree are then added as a new feature in the training features

3) A second decision tree is then trained on the training features (+ the predictions from the first decision tree) and class labels

4) result2['dtc2-classification'] contains the final classifications from the pipeline. These values should correspond to what you see when you call .predict() on the TPOT object.

If you have thoughts on how to make the generated code clearer or easier to use, please let me know.

Best,

Randy

kadarakos commented 8 years ago

Hi @rhiever ,

Both iris features and classes are encoded as floats.

Your explanation makes it clear how to interpret the generated code. It makes me wonder, however, if this is the best way to ensemble models. Imho using the VotingClassifier object would be a more standard/straightforward way of ensembling different classifiers, plus it provides some additional flexibility.

rhiever commented 8 years ago

Ah, I see what happened. The predict function is missing the .loc call at the end:

return result[result['group'] == 'testing', 'guess'].values

should be

return result.loc[result['group'] == 'testing', 'guess'].values

This has already been fixed in the development version, but I haven't rolled it out to pip yet. I will do this soon!

rhiever commented 8 years ago

wrt ensembles of classifiers: I agree 100%! This is also something we're working on in the near future -- adding a pipeline operator that pools classifications from multiple classifiers in different ways (majority etc.).

kadarakos commented 8 years ago

Thanks for the quick reply!

I evolved another piece of code that scores 1.0 on the iris data set, which is pretty impressive. However, I did raise some questions.

import numpy as np
import pandas as pd

from sklearn.cross_validation import StratifiedShuffleSplit
from itertools import combinations
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier # ME IMPORTING

# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('iris.csv', sep=',')
tpot_data['class'] = digits['target'] # ME CHANGING THE STRINGS TO INTEGERS

training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75)))

result1 = tpot_data.copy()

# Perform classification with a k-nearest neighbor classifier
knnc1 = KNeighborsClassifier(n_neighbors=min(13, len(training_indeces)))
knnc1.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces, 'class'].values)
result1['knnc1-classification'] = knnc1.predict(result1.drop('class', axis=1).values)

# Perform classification with a logistic regression classifier
lrc2 = LogisticRegression(C=2.75)
lrc2.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces, 'class'].values)
result2 = result1
result2['lrc2-classification'] = lrc2.predict(result2.drop('class', axis=1).values)

# Decision-tree based feature selection
training_features = result2.loc[training_indeces].drop('class', axis=1)
training_class_vals = result2.loc[training_indeces, 'class'].values

pair_scores = dict()
for features in combinations(training_features.columns.values, 2):
    print features
    dtc = DecisionTreeClassifier()
    training_feature_vals = training_features[list(features)].values
    dtc.fit(training_feature_vals, training_class_vals)
    pair_scores[features] = (dtc.score(training_feature_vals, training_class_vals), list(features))

best_pairs = []
print pair_scores
for pair in sorted(pair_scores, key=pair_scores.get, reverse=True)[:1070]:
    best_pairs.extend(list(pair))
best_pairs = sorted(list(set(best_pairs)))

result3 = result2[sorted(list(set(best_pairs + ['class'])))]

# Perform classification with a k-nearest neighbor classifier
knnc4 = KNeighborsClassifier(n_neighbors=min(6, len(training_indeces)))
knnc4.fit(result3.loc[training_indeces].drop('class', axis=1).values, result3.loc[training_indeces, 'class'].values)
result4 = result3
result4['knnc4-classification'] = knnc4.predict(result4.drop('class', axis=1).values)

A minor issue was that the DecisionTreeClassifier wasn't imported for the feature selection. Apart from that the I was a bit surprised by the way the feature selection part was implemented. I believe, could be replaced with the shorter - and maybe more general - code snippet from the sklearn documentation:

from sklearn.feature_selection import SelectFromModel
iris = load_iris()

feature_clf = DecisionTreeClassifier()
feature_clf = clf.fit(training_features, training_class_vals)
feature_select = SelectFromModel(feature_clf, prefit=True)
training_features_new = model.transform(training_features)

Is it just me or would this be a bit more concise?

Best, Ákos

kadarakos commented 8 years ago

Actually from observing the code a bit more precisely it seems to me that "result3" is just a sorted version of the original features:

result3 = result2[sorted(list(set(best_pairs + ['class'])))]

and then the kNN is fitted to the this sorted data frame

knnc4.fit(result3.loc[training_indeces].drop('class', axis=1).values, result3.loc[training_indeces, 'class'].values)

, so as far as I understand the feature selection was not actually performed. Running this piece of code

feature_clf = DecisionTreeClassifier()
feature_clf = feature_clf.fit(training_features, training_class_vals)
feature_select = SelectFromModel(feature_clf, prefit=True)
training_features_new = feature_select.transform(training_features)

actually shows - unsurprisingly - that the most informative features are the decisions of the previous classifiers.

rhiever commented 8 years ago

That's exactly right. It seems the feature selection in this case was "junk code" that wasn't pruned by the optimization process. i.e., because the feature selection didn't do anything, it wasn't optimized away. I'm working on code now that selects against bloat like that currently.

In the most recent version, we've actually removed the decision tree-based feature selection entirely and replaced it with more standard feature selection operators from sklearn: RFE, variance threshold, and various forms of univariate feature selection. Hopefully that will be out soon. You can check it out on the development version in the meantime.