Closed jonathanng closed 6 years ago
Here are a few old discussions (#214 #360 and #457) about stacking features in TPOT. StackingEstimator
in TPOT is a transformer to use results of some estimator as synthetic feature(s) so that the final X_transformed
could be n+1 or n+2 (here n is the number of features). I think StackingEstimator
may not use final estimator since it could be a intermediate step between two other transformers in a pipeline. @rhiever Any idea?
I think Jonathan is right and there is data leakage. I actually ran into this problem myself when implementing something similar. I decided to ignore this for now for several reasons:
Others found it had little impact on overfitting (e.g. from a top kaggler's blogpost):
This is leakage and in theory S could deduce information about the target values from the meta features in a way that would cause it to overfit the training data and not generalize well to out-of-bag samples. However, you have to work hard to conjure up an example where this leakage is significant enough to cause the stacked model to overfit. In practice, everyone ignores this theoretical hole (and frankly I think most people are unaware it even exists!).
If you want to avoid it, the evaluation process requires (potentially a lot) more computation time. Consider training a StackingEstimator with K-Fold you propose, in this case the estimator will generate out-of-sample predictions for each row in your data (so without data leakage), and the rest of the pipeline can then use this to make further predictions. This means instead of training one model, you trained K models. You need to do this for each fold in the cross validation of your pipeline (by default, 5). Then, consider adding another StackingEstimator, SE A, before your first StackingEstimator, SE B (i.e. the pipeline is now (SE A)->(SE B)->(final estimator)). SE A needs to provide predictions for SE B, for each fold SE B uses to create predictions. SE A can't simply be trained in K folds on all of the data, because then it would leak some data for SE B (i.e. if SE B makes predictions on a certain fold, SE A can not have been trained on any data of that fold). This means that to provide SE B with leak-free predictions, it needs to be trained K times for each time SE A is trained. So we see an exponential growth in the number of models to be trained to create a leak-free estimate (i.e. K^S where K is the number of folds you use per stacking estimator, and S is the number of sequential estimators). Especially in an evolutionary algorithm where we want to evaluate hundreds to tens of thousands (or more) pipelines, this is something to consider.
While the stacking estimator is trained with this leak, the evaluation of the entire pipeline will still be on out-of-sample rows. This means the StackingEstimator's final model will not actually be trained on data of that fold. The leakage might introduce a bias to use more of the StackingEstimator's predictions, but if they turn out not to generalize well, they will soon enough be dropped.
All in all, in theory, I do think you should avoid this issue. In practice, I am not convinced the benefits will outweigh the costs.
Please let me know if you see any issues with the statements above :)
I agree that there is some data leakage, but I don't think data leakage is the biggest problem. The biggest problem is overfitting.
Notes:
Stacking, as described in https://github.com/civisanalytics/civisml-extensions/blob/master/civismlext/stacking.py#L132 is:
Fit the base estimators on CV folds, then use their prediction on the validation folds to train the meta-estimator. Then re-fit base estimators on full training set.
The variation done by the kaggle post is to forget about the validation folds, which is acceptable to me. This is plausible because for example the model trained on folds {2-5} that transformes fold 1 just averages the relationship between X_{2-5} and y_{2-5}, so the leakage on y_{2-5} can be neglected. Since the models are not trained end-to-end (like in a deep learning fashion), they won't learn to exploit the leakage.
Note that there are three steps:
If there is any error in this part, please correct me.
I'm considering the following sample code:
make_pipeline(
[
StackingEstimator(Model1()),
StackingEstimator(Model2()),
StackingEstimator(Model3()])),
EnsembleModel()
]
)
StackingEstimator is just a wrapper to make scikit-learn think that all estimators but the last are transformers.
Now, there are only two steps on the pipeline.
Cross-validation This corresponds to the "cross-validation" of the stacked estimators and the meta-estimator training. The StackingEstimators will fit on folds {2-5}, then transform folds {1-5} and the ensemble will learn on folds {2-5} to predict fold 1.
Fit Each StackingEstimator learns on the whole dataset, then makes predictions on the whole dataset. Then, the Ensemble model is fitted again on those predictions.
The problem is not in the architecture or the prediction method (I'm totally fine with "adding" features sequentially), but in the training.
The model selection is done by a cross-validation (averaging some train/test splits), and the problem is that the stacked model is trained without cross-validation. The consequence is that you train the meta model to select the most overfitting base model. This problem happens during both the cross-validation and the fit phase. Doing a cross-validation on underperforming models is not a critical issue if all models are affected in the same way. The big problem is that the final models are not good.
The cross-validation is already a bad thing as the ensemble will just select the StackingEstimator that overfits the most, but it is quite complicated to see how bad it is. Of course it will modify the performance of the model (decrease it), but at this step, all that counts is how the pipelines are ranked and I have no argument to prove that this order will be modified. I think it is not even as bad as it seems because it will really deplete the performance of the pipelines that use StackingEstimators prone to overfitting.
The really bad thing is that it also happens when you want to fit the whole model (final step). Basically, you lose the whole point of stacking because the meta model is trained on the training predictions of the stacked models, and not on out-of-sample predictions. That is, you select the models that overfit the most but not the ones that generalize well.
I made an example notebook, hosted on colab: https://colab.research.google.com/drive/1mLGhBcshGp1bUbbo4va2RGP-4j0LWV6v
StackingEstimator
looks a lot like this stackexchange
answer, but improved a
lot the code quality.
However, the stacking process is not perfectly compatible with a
straightforward fit
and transform
API. When training a stacking
model, one must ensure that the intermediate models make prediction on
samples they were not trained on. Thus, it is not possible to modify
models to put them in a pipeline.
My solution only modifies the Pipeline
object to:
fit
function when there is more than one estimatorThis will create two new objects:
StackingPipeline
make_stacking_pipeline
It is now important to distinguish the folds that are used in the cross-validation process, and the folds that are used in the stacking. They can be different.
In the kaggle article, no cross-validation is done on the stacking estimator.
Here, I write K the number of stacking folds and S the number of estimators.
My solution only requires fitting N = K*(S-1) + S estimators.
Algorithm:
If you want to add cross-validation with K_cv > 1 folds, then you have to fit K_cv * N estimators during the selection phase, and again N estimators on the full training dataset.
However, I think that in this case, taking K_cv = 1 is good enough, as we can suppose the meta estimator has a relatively stable performance.
The stacking of the kaggle article can successfully be implemented with a code like this:
model = make_stacking_pipeline(
make_union(KNeighborsClassifier(), SVC()),
LogisticRegression()
)
with K=5 and K_cv=1
I think that
model = make_stacking_pipeline(
KNeighborsClassifier(),
SVC(),
LogisticRegression()
)
would have similar (if not greater) performance, because data leakage is not big. This approach would be equivalent to restacking like in StackNet and wolpert.
K depends on the StackingPipeline
and K_cv depends on the
cross-validation function, so there is no way to set K_cv=1 when a
StackingPipeline
is used. On the other hand, it is a good thing to
still have cross-validation if the meta model is prone to overfitting.
The two main problems are that:
However, K can be restricted to reasonable values less than 4, if one supposes that how the models will be assembled depends more on the pipeline than on the quantity of data they were trained on.
Note that in most use cases, S is small so the asymptotic overload is much less than K. Here is a table of the overload:
number of estimators: S | stacking folds: K | estimators: (K * max(0, S-1) + S) | overload: (K * max(0, S-1) + S) / S |
---|---|---|---|
1 | 2 | 1 | 1 |
2 | 2 | 4 | 2 |
3 | 2 | 7 | 2,333333333 |
1 | 3 | 1 | 1 |
2 | 3 | 5 | 2,5 |
3 | 3 | 9 | 3 |
1 | 4 | 1 | 1 |
2 | 4 | 6 | 3 |
3 | 4 | 11 | 3,666666667 |
Thus, in most cases, the overload remains acceptable (having more than 3 estimators is rare).
Following the discussion on #457, I would be glad to hear feedback from @rasbt about this "multilevel" restacking.
I would also like to have some feedback from the core maintainers as I am willing to implement the above.
I believe the current implementation of
StackingEstimator
is incorrect. There is a very large possibility of data leakage.Let's say
X
has 100 rows and 10 columns.With the current implementation, a stacking estimator is trained on the full 100 rows, and a
X_transformed
is created by taking predictions from the stacking estimator and adding it as another column toX
. The final result is an 100 row by 11 column matrix, which will be fed to the final estimator.But this is incorrect. The problem comes down to this: the stacking estimator injects future knowledge into X, and the final estimator is trained using this future knowledge.
If you look at this implementation of stacking, they do something different:
https://github.com/civisanalytics/civisml-extensions/blob/master/civismlext/stacking.py
With their implementation, a stacking estimator is trained on only 50 rows, and the final estimator is trained on the remaining 50 rows with an addition column for predictions of the stacking estimator. Finally, the stacking estimator is retrained on all 100 rows.
Let me know if I'm missing something. Thanks.