EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.74k stars 1.57k forks source link

feature mismatch #1216

Open ud2195 opened 3 years ago

ud2195 commented 3 years ago

I used tpot==0.11.7 to train my model

code flow:-

from tpot import TPOTClassifier
from sklearn.metrics import classification_report

pipeline_optimizer = TPOTClassifier(generations=5, population_size=50, cv=5,
                                    random_state=42, verbosity=2, scoring='f1')

pipeline_optimizer.fit(X_train, y_train)

predictions = pipeline_optimizer.predict(X_test)
print(classification_report(y_test, predictions))

#extracted the best model 
exctracted_best_model = pipeline_optimizer.fitted_pipeline_.steps[-1][1]

reference(https://stackoverflow.com/questions/57369927/getting-feature-importances-after-getting-optimal-tpot-pipeline)

now when i am running this exctracted_best_model on x_test test_results = exctracted_best_model.predict(X_test)

it says ValueError: Number of features of the model must match the input. Model n_features is 418 and input n_features is 417

X_test only has 417 features how can the model train on more than that? is exctracted_best_model = pipeline_optimizer.fitted_pipeline_.steps[-1][1] not the right way to get best model from tpot ? can i straight away dump pipeline_optimizer as pickle object and get the best model till now ?

Also how do you recreate the below in sklearn ?

Best pipeline: DecisionTreeClassifier(LinearSVC(input_matrix, C=10.0, dual=False, loss=squared_hinge, penalty=l1, tol=0.1), criterion=gini, max_depth=10, min_samples_leaf=14, min_samples_split=7)

rachitk commented 3 years ago

Hello @ud2195,

Your issue may be similar to a previously reported one (attempting something similar) at #738, if any of the information there helps you.

Is the full pipeline that is exported by TPOT when this error is thrown the same one you have at the end of your issue?

The code you have seems to pull the final, fitted operator in your pipeline and tries to have it predict on the input X_test (which presumably has the same number of features as X_train, though I would check and make sure X_train and X_test are the same shape).

TPOT sometimes includes operators that can stack the output of other classifiers/operators as an additional feature on top of the original input (so while X_train may have 417 features, the eventual input into the final operator of your pipeline might have more features if additional features are stacked by earlier operators).

To recreate the pipeline in sklearn, you can export the pipeline by doing

pipeline_optimizer.export("exported_pipeline.py")

You can use the code found in this file to build and refit the best pipeline for your data, which may help with the problem that you're having.

As TPOT uses dynamic classes for its strongly-typed DEAP implementation, the TPOT object is not pickleable. As a workaround, you can pickle the fitted pipeline (pipeline_optimizer.fitted_pipeline_ in your code).

ud2195 commented 3 years ago

@rachitk - actually i didnt export the pipeline before closing the problem since i dont work that much with sklearn pipeline. However, I think exctracted_best_model = pipeline_optimizer.fitted_pipeline_.steps[-1][1] is not the right way to get the best model and pickle it.

As suggested by you and by googling more. Ill just do exctracted_best_model = pipeline_optimizer.fitted_pipeline_ pickle.dump(exctracted_best_model, open('/xyz/model', 'wb'))

and i think youre right in saying "TPOT sometimes includes operators that can stack the output of other classifiers/operators as an additional feature on top of the original input (so while X_train may have 417 features, the eventual input into the final operator of your pipeline might have more features if additional features are stacked by earlier operators)."

I believe ^ is what the bestpipeline i mentioned in the question implies. Tpot is somehow using linearSVC and stacking its feature and ultimately training decisiontreeclassifier over it. Please correct me if i am wrong here.

rachitk commented 3 years ago

I believe ^ is what the bestpipeline i mentioned in the question implies. Tpot is somehow using linearSVC and stacking its feature and ultimately training decisiontreeclassifier over it. Please correct me if i am wrong here.

Your interpretation is my understanding as well. TPOT will use a special operator (not always obvious from the base TPOT output, but is usually clearer in the exported Python file) that will append the outputs of classifiers that are not the final classifier in the pipeline to the original input matrix to allow for nonlinear tree generation and for classifiers to be included elsewhere in the pipeline besides at the very end. That is likely what happened here if the pipeline at the end of your issue is the same one causing the error you see (the output of LinearSVC was appended to the original input matrix as a synthetic feature for DecisionTreeClassifier to incorporate).

Doing exctracted_best_model = pipeline_optimizer.fitted_pipeline_.steps[-1][1] only pulls out the final operator within the pipeline, rather than the entire best pipeline itself (that would be just pipeline_optimizer.fitted_pipeline_).

By only selecting the last operator in the sklearn pipeline, you're skipping the steps that would add that feature to the original input matrix, causing a feature mismatch. To fix this, you can try to evaluate feature performance for the entire pipeline (rather than the last operator) or try an alternative that was mentioned in #738 - using permutation importance via eli5 (https://eli5.readthedocs.io/en/latest/blackbox/permutation_importance.html)