Closed saddy001 closed 7 years ago
The FunctionTransformer(copy)
object allows for a basic form of stacking when a classifier is present in the middle of a pipeline. FunctionTransformer(copy)
makes a copy of the entire dataset, and that is merged with the predictions of a classifier on that dataset.
Nice, a feedback classifier. Is it somewhere mentioned in the sklearn docs, I couldn't find anything about this?
I don't think this is mentioned in the sklearn docs. We implemented this feature ourselves within the existing sklearn pipeline framework.
This can lead to weird pipelines though. Here is what I got
# Score on the training set was:0.333968253968
exported_pipeline = make_pipeline(
make_union(
FunctionTransformer(copy),
FunctionTransformer(copy)
),
RandomForestClassifier(bootstrap="false", criterion="gini", max_features=0.15, min_samples_leaf=10, min_samples_split=4, n_estimators=100)
)
In this case I doubt that the FunctionTransformer(copy)
is useful. I guess adding copies is roughly equivalent to tweaking the max_features
parameter of the random forest.
Context:
That's interesting. How long did you run TPOT (population & generations) when it gave you this solution?
I since deleted this example but I got another one:
# Score on the training set was:0.522222222222
exported_pipeline = make_pipeline(
make_union(
FunctionTransformer(copy),
FunctionTransformer(copy)
),
StandardScaler(),
MaxAbsScaler(),
StackingEstimator(estimator=LinearSVC(C=10.0, dual=False, loss="squared_hinge", penalty="l2", tol=0.01)),
RandomForestClassifier(bootstrap=True, criterion="gini", max_features=0.5, min_samples_leaf=8, min_samples_split=18, n_estimators=100)
)
Here is the tpot classifier that I configured:
model = tpot.TPOTClassifier(
cv=LeaveOneGroupOut(),
scoring=experiment.build_scorer(),
periodic_checkpoint_folder=files.create_abspath('models/multi_pca_usine_lcdv'),
max_time_mins=11 * 60,
max_eval_time_mins=10,
n_jobs=10,
verbosity=2
)
So population is 100. Not sure about the number of generations at this point.. I guess at least 5 since there are 5 exported pipelines in the output folder before this one.
The optimizer ran for ~6 hours before reaching this intermediate result (better pipelines obtained later in the same run did not contain such artifacts).
Ah, 5 generations isn't very much time for TPOT to really refine the pipelines - at that point the GA has only gone through 5 rounds of selection. That's good to hear that pipelines from later in the run didn't retain this artifact.
The reason why TPOT doesn't immediately get rid of pipelines like this is because this artifact is potentially useful for building more complex pipelines later in the optimization process. Either of those FunctionTransformer
s can be replaced with another pipeline operation in subsequent generations, and that could potentially be useful for improving prediction performance. As such, our pipeline regularization process doesn't penalize pipelines that make two copies of the features like this because it technically doesn't "hurt" the pipeline.
We've discussed other approaches to pipeline regularization (#207) that would probably weed out pipelines like you showed above, bu we haven't gotten to implementing those ideas yet.
Interesting, thank you for the explanation. Overall I found TPOT to be very useful, well done!
In my best estimator I see a FunctionTransformer(copy). Is it useful? It just seems to copy the input to the output.