AxeldeRomblay / MLBox

MLBox is a powerful Automated Machine Learning python library.
https://mlbox.readthedocs.io/en/latest/
Other
1.49k stars 274 forks source link

Feedback Stacking #88

Closed brunosez closed 4 years ago

brunosez commented 4 years ago

Hi all,

Could you clarify stacking ? , I've done an experiment with this space space = {

#    'ne__numerical_strategy':{"search":"choice",
#                             "space":[0]},
    'ce__strategy':{"search":"choice",
                    "space":["label_encoding","random_projection", "entity_embedding"]}, 
    'fs__threshold':{"search":"uniform",
                    "space":[0.01,0.4]},    
    'est__max_depth':{"search":"choice",
                              "space":[3,4,5,6,7,10,15,20,30,50]},
    'est__n_estimators':{"search":"choice",
                              "space":[200,400,800,1000]},
    'est__baseestimators':{"search":"choice","space":['LightGBM','RandomForest','ExtraTrees']},

    'est__levelestimator':{"search":"choice","space":['LightGBM']}

    }

and obtain best param with only one model at base estimators ?

{'cestrategy': 'label_encoding', 'estbaseestimators': 'ExtraTrees', 'estlevelestimator': 'LightGBM', 'est__max_depth': 50, 'estn_estimators': 1000, 'fs__threshold': 0.0349813407317037}

Thanks

Best Regards Bruno

AxeldeRomblay commented 4 years ago

Hello @brunosez,

Good question ! At the moment it is not possible to optimise over the base estimators (hyperopt code raises an error because the parameter is a list...). Nevertheless, you can optimise over the other parameters (like n_folds, copy, ...). Also be careful here, you need to call the stacking step so you have to replace 'est' by 'stck' like this :

space = { 'stck__copy' : {"search":"choice", "space":[True, False]}, 'est__strategy' : {"search":"choice","space":['LightGBM','RandomForest','ExtraTrees']}}

opt.optimise(space, df, 5)

If you want to try different staking configuration (for a classification otherwise you'll need to replace Classifier by Regressor), you can call evaluate each configuration like this :

params = {'stck__base_estimators' : [Classifier(strategy="LightGBM"), Classifier(strategy="RandomForest"), Classifier(strategy="ExtraTrees")], 'est__strategy':"Linear"}

opt.evaluate(params, df)

I will work on improving this. I am aware it is a bit tricky... If you have any other questions, feel free to ask ! Enjoy !

Axel

brunosez commented 4 years ago

Thanks @AxeldeRomblay ! The process could be to optimize individually each base estimators, note the params and run again with another small libs like in this notebook https://www.kaggle.com/jakelj/basic-ensemble-model

== from mlxtend.classifier import StackingCVClassifier

ensemble = [ ('clf_knc', clf_knc), ('clf_hbc', clf_hbc), ('clf_etc', clf_etc), ('clf_lgbm', clf_lgbm)

       ]

stack = StackingCVClassifier(classifiers=[clf for label, clf in ensemble], meta_classifier=clf_lgbm, cv=3, use_probas=True, use_features_in_secondary=True, verbose=-1, n_jobs=-1)

stack = stack.fit(X,y) predictions = stack.predict(test)

=== Could be useful to post an example with stacking.

Rgds Bruno