flennerhag / mlens

ML-Ensemble – high performance ensemble learning
http://ml-ensemble.com
MIT License
843 stars 108 forks source link

Stacking of Classifiers that Operate on Different Feature Subsets #129

Closed m-mohsin-zafar closed 4 years ago

m-mohsin-zafar commented 4 years ago

I have a dataset with say 200 features. What I want is to give 30 features to 1 classifier, 90 to another and 80 to another in one layer of ensembled clfs and then take their outputs to and give them to a meta classifier. I believe this is achievable via Subset Class available in your library but can't figure the right way. I have found a similar way in another library 'mlxtend' code of which is available below. However, I'd lie to do my work via your library. Thanking you in anticipation.

from sklearn.datasets import load_iris
from mlxtend.classifier import StackingCVClassifier
from mlxtend.feature_selection import ColumnSelector
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression

iris = load_iris()
X = iris.data
y = iris.target

pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)),
                      LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
                      LogisticRegression())

sclf = StackingCVClassifier(classifiers=[pipe1, pipe2], 
                            meta_classifier=LogisticRegression(),
                            random_state=42)

sclf.fit(X, y)
Akshay-Ijantkar commented 4 years ago

Same doubt !

Akshay-Ijantkar commented 4 years ago

@m-mohsin-zafar thank you for sharing mlxtend code! Have you tried pystacknet, vecstack and scikit - learn

Akshay-Ijantkar commented 4 years ago

@m-mohsin-zafar mlens documentation is seriously very confusing and not up to the mark

flennerhag commented 4 years ago

Hi there,

Thanks for reaching out!

We can achieve what you're looking for using dicts to specify pipelines when adding a layer to the ensemble:

from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from mlens.ensemble import SuperLearner
from mlens.preprocessing import Subset

iris = load_iris()
X = iris.data
y = iris.target

ens = SuperLearner()
ens.add(estimators={"pipe-1": [LogisticRegression()],
                    "pipe-2": [LogisticRegression()]},
        preprocessing={"pipe-1": [Subset([0, 2])],
                       "pipe-2": [Subset([1, 2, 3])]})
ens.add_meta(LogisticRegression())
ens.fit(X, y)

The key to note is that the values in these dicts should be lists:

ests = {pipe_1: [est_1, est_2, ...], pipe_2: [est_1, est_2, ...]}
prps = {pipe_1: [trans_1, trans_2, ...], pipe_2: [trans_2, trans_2, ...]}

So if we feed an input X to this layer, it will get processed in parallel through pipe_1 and pipe_2. In each of these, we obtain preprocessed features X -> trans_1 - > trans_2 - > X_[1,2] that we feed to the list of estimators in that pipeline. The output of a layer is the concatenation of all predictions:

P = [pipe_1_est_1(X_1), pipe_1_est_2(X_1), ..., pipe_2_est_1(X_2), pipe_2_est_2(X_2), ...]

Note that you can also propagate features from the input array X to the output array P by using the propagate_features argument when adding a layer to the ensemble:

ens.add(estimators=ests, preprocessing=prep, propagate_features=[0, 1, 2])

The reason for using this logic is that it allows us to run a preprocessing pipeline just once and then have many estimators using those features. The sklearn version would require us to re-run the preprocessing step for every estimator, which isn't efficient.

Having said that, you can mix and match between mlxtend, mlens, and scklearn:

from sklearn.datasets import load_iris
from mlxtend.feature_selection import ColumnSelector
from sklearn.linear_model import LogisticRegression
from mlens.ensemble import SuperLearner
from sklearn.pipeline import make_pipeline

iris = load_iris()
X = iris.data
y = iris.target

pipe1 = make_pipeline(ColumnSelector(cols=(0, 2)),
                      LogisticRegression())
pipe2 = make_pipeline(ColumnSelector(cols=(1, 2, 3)),
                      LogisticRegression())

ens = SuperLearner()
ens.add([pipe1, pipe2])
ens.add_meta(LogisticRegression())
ens.fit(X, y)

Hope this helps! Feel free to reopen this issue otherwise : )