EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.68k stars 1.57k forks source link

Cannot reproduce pipeline results with sklearn pipeline #1289

Open DrRaja opened 1 year ago

DrRaja commented 1 year ago

For my data, I got the best pipeline by running TPOT training using the following parameters:

from tpot import TPOTClassifier
tpot = TPOTClassifier(generations=5,
                      population_size=100, 
                      verbosity=2, 
                      n_jobs=-1,random_state=1)

The best pipeline was given as:

Best pipeline: MLPClassifier(GaussianNB(Binarizer(input_matrix, threshold=0.0)), alpha=0.001, learning_rate_init=0.001)
TPOTClassifier(generations=5, n_jobs=-1, random_state=1, verbosity=2)

The best CV score I achieved was 0.822

Using the ensemble provided above I trained an ensemble pipeline using sklearn as:

base_model = GaussianNB()

meta_model = MLPClassifier(random_state=1, 
                        learning_rate_init=0.001,
                        alpha=0.001)

ensemble = StackingClassifier(estimators=[('base_model', base_model), 
                                                     ('meta_model', meta_model)],
                                         final_estimator=meta_model,
                               n_jobs=-1)

The score I get from this is 0.79

Can you tell me why I getting different scores when all my parameters are same?

perib commented 1 year ago

The manual pipeline is not exactly identical to the TPOT output. It is missing the Binarizer step.

Also, TPOT wraps internal classifiers in a StackingEstimator. This will pass through its inputs in addition to its predictions. (https://github.com/EpistasisLab/tpot/blob/master/tpot/builtins/stacking_estimator.py).

Going off memory, I believe this is what the TPOT output would be equivalent to:

step1 = Binarizer(threshold=0.0)

base_model = StackingEstimator(GaussianNB())

meta_model = MLPClassifier(random_state=1, 
                        learning_rate_init=0.001,
                        alpha=0.001)

ensemble = sklearn.pipeline.Pipeline(estimators=[('step1',step1),
('base_model', base_model), 
                                                     ('meta_model', meta_model)],
                                         final_estimator=meta_model,
                               n_jobs=-1)

The binarized transforms the data -> transformed data -> GaussianNB -> transformed data + predictions -> MLPclassifier