step predictions and data leakage

alegonz / baikal

A graph-based functional API for building complex scikit-learn pipelines.

https://baikal.readthedocs.io

BSD 3-Clause "New" or "Revised" License

592 stars 30 forks source link

step predictions and data leakage #15

Closed bmreiniger closed 4 years ago

bmreiniger commented 4 years ago

It looks like you use the same, entire, dataset for each step. But then the inputs to steps beyond the first layer are the "predictions" from models on their own training sets, which seems prone to data leakage and overfitting. See e.g. mlxtend's StackingClassifier vs. StackingCVClassifier.

alegonz commented 4 years ago

Yes, I'm already aware of this, thank you. I am already working on updating the examples and a new API for supporting out-of-fold predictions in the first level. Please refer to the discussion on Issue #13 for more details.

alegonz commented 4 years ago

Stacks following the standard protocol (and not prone to data leakage) are now possible with the new API release in version 0.3.0. Please refer here for an example.