automl / auto-sklearn

Automated Machine Learning with scikit-learn
https://automl.github.io/auto-sklearn
BSD 3-Clause "New" or "Revised" License
7.53k stars 1.27k forks source link

include stacking #316

Open rspadim opened 7 years ago

rspadim commented 7 years ago

hi guys could stacking be included, instead of only ensemble? (maybe from mlxtend package)

mfeurer commented 7 years ago

We're actually evaluating alternatives to ensemble selection. To help us a little bit, do you know any publication or website where it's shown that stacking outperforms ensemble selection? Also, I'm wondering how one would make sure that the hyperparameters of a stacking algorithm are set in a way that the stacking algorithm does not overfit. Maybe @rasbt can also give some insights here as his package was mentioned.

rasbt commented 7 years ago

Hi, I don't know of a direct comparison off the top of my head; there seems to be a relevant section in Proceedings of the 15th international joint conference on Artifical intelligence - Volume 1 but I currently can't access the full resource.

In any case, I think the methods are in a sense so different that one probably can't universally say that one is better than the other. In majority vote ensembles, you typically work with class label information and you combine them with arbitrary weighting (typically, that's another hyperparameter). In stacking, you work with the output of classifiers (typically the probabilities) to learn a meta-classifier on top. Also, in stacking you typically include the original features beside the first level classifier outputs. I would say that stacking is more of a meta-machine learning task whereas ensemble voting classification is more of a empirical combination (plus weight hyperparameter tuning). Intuitively, I would say that stacking should work better in many cases (unless it suffers from extreme overfitting), but again, there are probably many exceptions to that.

mfeurer commented 7 years ago

I think there seems to be some misconception about the ensemble method employed by Auto-sklearn. Auto-sklearn uses ensemble selection with replacement to select a subset of relevant models and weight them appropriately (using probabilities instead of the predicted class). In general, it would be great to have some paper comparing all these methods to another, but I don't know of any.

jclevesque commented 7 years ago

Problem is that to have a proper stacking, it might be important to consider the hyperparameters of your stacker (see where this is going?). A recent paper at ICDM stacked another loop of Bayesian hyperparameter optimization on top of the first, i.e. where you would put the ensemble selection with replacement in AutoSklearn. It seems to work fine although I'm not sure how they achieve such a great speedup over AutoSklearn in their experiments.

arnaudsj commented 7 years ago

I just ran into a new stacking framework/lib that is about to be presented in July at Infiniteconf 2017 in London:

https://github.com/kaz-Anova/StackNet

Lots of information on the repo on how it works, the models supported, and how hyperparameter optimization works. Might be worth looking at to compare how stacking might work compared to the ensemble selection with replacement auto-sklearn implements.

rasbt commented 7 years ago

@mfeurer Oh yeah, totally mis-read the question/comment. I thought it was about the classic "majority vote ensembling" (like we implemented as VotingClassifier in sklearn) vs stacking. Never mind then :P

mfeurer commented 7 years ago

Thanks @jclevesque and @arnaudsj for the links. I added the Frankensteining paper to my reading list. @kaz-Anova is there a paper describing StackNet?

@rasbt no worries. Thanks for your taking your time to answer this question.

ledell commented 6 years ago

Also, in stacking you typically include the original features beside the first level classifier outputs.

@rasbt I don't think that this is typical -- at least among the folks I know who do stacking on a regular basis. I don't recommend adding the original features -- in my experience (and also according to some Kaggle Grandmasters that I've spoken to), this can cause overfitting and rarely leads to better performance. Have you seen better performance by adding in original features?

@mfeurer @rasbt Regarding the question of whether stacking vs ensemble selection is better -- I think it largely depends on your selection of base models. In some very limited experiments, I have found that stacking a collection of models generated via bayesian hyperparamter optimization process is not as effective as stacking models found via random search across multiple algorithms -- stacking works much better when you have a more diverse set of models (like you'd get in a random search). This is the method I use in H2O AutoML.

mfeurer commented 6 years ago

We found that stacking is overfitting on predictions from models found with Bayesian optimization, while ensemble selection is not. That's the only reason why we implemented ensemble selection instead of stacking (although one could see it as stacking if one interprets repetitions of adding a model to the ensemble as weights). I don't think there's any comparative study on which ensemble methods works best on which kind of data.

ledell commented 6 years ago

@mfeurer I did not compare to ensemble selection, but my take-away is consistent with your finding, which is that stacking is less effective on a set of models found with Bayesian optimization (which tend to be somewhat homogeneous). It would be interesting to see a large study on this!

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs for the next 7 days. Thank you for your contributions.