EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.72k stars 1.57k forks source link

Model with FeatureUnion with identical FunctionTransformers #807

Open austinpoulton opened 5 years ago

austinpoulton commented 5 years ago

My team uses TPOT for model selection as part of ML pipeline to help inform risk opinions for online transactions. We have been using the default classifier configuration provided by TPOT and have for the most part been getting good classifier performance. Our most recent model appeared to have consistent performance against the held-out-set and scoring distribution, but performed poorly serving prediction against out of sample data. We noticed that this pipeline included a FeatureUnion of twi copy FunctionTransformers. The input to the RF classifier was double the width.

fu = m37_pipe.steps[0][1]
f0 = fu.transformer_list[0][1]
f1 = fu.transformer_list[1][1]
f0.func == f1.func

True

All of our prior classifiers have been some form of Tree ensemble with not pre-processing steps.

We have been using TPOT 0.9.0 and the only change prior to our weird model was upgrading Pandas to 0.23.
TPOT configuration used:

tpot = TPOTClassifier(generations=5,
                          population_size=20,
                          verbosity=0,
                          n_jobs=1,
                          config_dict=tpot.config.classifier_config_dict)

The fitted TPOT pipeline in question was:

[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('functiontransformer-1', FunctionTransformer(accept_sparse=False, func=<function copy at 0x101d1f1e0>,
          inv_kw_args=None, inverse_func=None, kw_args=None,
          pass_y='deprecated', validate=True)), ('functiontransformer-2', FunctionTransformer(accept_sparse=False, func=<function copy at 0x101d1f1e0>,
          inv_kw_args=None, inverse_func=None, kw_args=None,
          pass_y='deprecated', validate=True))],
       transformer_weights=None)), 
 ('randomforestclassifier', RandomForestClassifier(bootstrap=False, class_weight=None,
            criterion='entropy', max_depth=None, max_features=0.2,
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=3,
            min_samples_split=14, min_weight_fraction_leaf=0.0,
            n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False))]

I don't really understand why such a preprocessing came about which seems nugator and produced very skewed scoring distribution for out-of-sample predictions.

weixuanfu commented 5 years ago

The FeatureUnion with identical FunctionTransformers in that pipeline means combining input matrix (related to this function). Since current TPOT randomly generates tree-based pipeline so some pipelines need this FeatureUnion operator to combine transformed features or raw input features from two branches for tree. But sometimes, like the case in this issue, the FeatureUnion just doubles feature spaces of raw input features.