Question FunctionTransformer(copy)

saddy001 commented 7 years ago

In my best estimator I see a FunctionTransformer(copy). Is it useful? It just seems to copy the input to the output.

rhiever commented 7 years ago

The FunctionTransformer(copy) object allows for a basic form of stacking when a classifier is present in the middle of a pipeline. FunctionTransformer(copy) makes a copy of the entire dataset, and that is merged with the predictions of a classifier on that dataset.

saddy001 commented 7 years ago

Nice, a feedback classifier. Is it somewhere mentioned in the sklearn docs, I couldn't find anything about this?

rhiever commented 7 years ago

I don't think this is mentioned in the sklearn docs. We implemented this feature ourselves within the existing sklearn pipeline framework.

BenjaminHabert commented 6 years ago

This can lead to weird pipelines though. Here is what I got

# Score on the training set was:0.333968253968
exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    RandomForestClassifier(bootstrap="false", criterion="gini", max_features=0.15, min_samples_leaf=10, min_samples_split=4, n_estimators=100)
)

In this case I doubt that the FunctionTransformer(copy) is useful. I guess adding copies is roughly equivalent to tweaking the max_features parameter of the random forest.

Context:

train dataset of ~30 000 sample x 15 features
tpot v 0.9.2

rhiever commented 6 years ago

That's interesting. How long did you run TPOT (population & generations) when it gave you this solution?

BenjaminHabert commented 6 years ago

I since deleted this example but I got another one:

# Score on the training set was:0.522222222222
exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    StandardScaler(),
    MaxAbsScaler(),
    StackingEstimator(estimator=LinearSVC(C=10.0, dual=False, loss="squared_hinge", penalty="l2", tol=0.01)),
    RandomForestClassifier(bootstrap=True, criterion="gini", max_features=0.5, min_samples_leaf=8, min_samples_split=18, n_estimators=100)
)

Here is the tpot classifier that I configured:

model = tpot.TPOTClassifier(
    cv=LeaveOneGroupOut(),
    scoring=experiment.build_scorer(),
    periodic_checkpoint_folder=files.create_abspath('models/multi_pca_usine_lcdv'),
    max_time_mins=11 * 60,
    max_eval_time_mins=10,
    n_jobs=10,
    verbosity=2
)

So population is 100. Not sure about the number of generations at this point.. I guess at least 5 since there are 5 exported pipelines in the output folder before this one.

The optimizer ran for ~6 hours before reaching this intermediate result (better pipelines obtained later in the same run did not contain such artifacts).

rhiever commented 6 years ago

Ah, 5 generations isn't very much time for TPOT to really refine the pipelines - at that point the GA has only gone through 5 rounds of selection. That's good to hear that pipelines from later in the run didn't retain this artifact.

The reason why TPOT doesn't immediately get rid of pipelines like this is because this artifact is potentially useful for building more complex pipelines later in the optimization process. Either of those FunctionTransformers can be replaced with another pipeline operation in subsequent generations, and that could potentially be useful for improving prediction performance. As such, our pipeline regularization process doesn't penalize pipelines that make two copies of the features like this because it technically doesn't "hurt" the pipeline.

We've discussed other approaches to pipeline regularization (#207) that would probably weed out pipelines like you showed above, bu we haven't gotten to implementing those ideas yet.

BenjaminHabert commented 6 years ago

Interesting, thank you for the explanation. Overall I found TPOT to be very useful, well done!

EpistasisLab / tpot

Question FunctionTransformer(copy) #581