EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.66k stars 1.56k forks source link

TPOTEnsemble idea #479

Open rhiever opened 7 years ago

rhiever commented 7 years ago

Many people have been asking for a version of TPOT that creates ensembles of pipelines, as that's what often wins Kaggle competitions etc. We've created prototypes of TPOT that ensemble the Pareto front or final population, but those prototypes didn't work so well because TPOT pipelines are optimized to perform well on a dataset by themselves. In other words, there is no pressure from TPOT to create pipelines that work well with other pipelines.

Here's my proposal for allowing TPOT to create ensembles of pipelines: What if we treated the TPOT optimization procedure as a sort of boosting procedure? It could work as follows:

1) Create initial population (P0) and evaluate them on the dataset as normal. 2) Take the best pipeline from P0 and put it into a VotingClassifier 3) Generate the next population (P1) using the normal fitness scores. 4) When evaluating the individuals in P1, their fitness is computed by evaluating them in the VotingClassifier with the best pipeline from P0 5) Take the best pipeline from P1 and put it into the VotingClassifier with the best pipeline from P0 6) Generate the next population using these "ensemble fitness scores" 7) Evaluate the pipelines in the new generation by evaluating them in a VotingClassifier with the best individuals from the previous generations 8) etc.

That way, TPOT is directly optimizing for pipelines that ensemble well with the previously-best pipelines, and the final ensemble is composed of one pipeline from each generation. Is this idea crazy enough to work?

rhiever commented 7 years ago

I made a hacky demo of the TPOTEnsemble idea in this commit.

It seemed to work fine in my tests, although it gets much, much slower as the generations pass because, e.g., by generation 100 every pipeline is being evaluated in a VotingClassifier with 99 other pipelines. The only reasonable solution seems to be to store the predictions of each "best" pipeline from every generation, and manually ensemble those predictions with the new predictions from the pipelines in the current generation.

Of course, there will be no way around storing the entire pipeline list in a VotingClassifier for new predictions in the TPOT predict and score functions. But at least the above solution will save evaluating the same list of pipelines over and over again.

reiinakano commented 7 years ago

Check this out: https://github.com/scikit-learn/scikit-learn/pull/8960

In the next release, scikit-learn is probably going to get an implementation of stacking classifier, so TPOT might be able to search stacked ensembles the same way it searches pipelines.

rhiever commented 7 years ago

Awesome. I look forward to the next release, then!

simonzcaiman commented 7 years ago

Ensemble of pipelines would be a great improvement for TPOT! Will it be better if there is a stacking model selection as well? For example, if one does not want to use a VotingClassifier as the stacking model, can he also use another TPOT pipeline optimization to choose the best stacking model?

rhiever commented 7 years ago

@simonzcaiman, this is certainly something we should discuss now before we move forward with actual implementation of TPOTEnsemble. It seems like a good idea to allow different ensemble methods, but I only know of the ones in VotingClassifier from sklearn. Are there are ensemble methods (preferably with a sklearn-like interface) that we should be aware of?

sashml commented 7 years ago

Are there are ensemble methods (preferably with a sklearn-like interface) that we should be aware of?

Not sure if you should, but Sebastian has own Stacker here https://rasbt.github.io/mlxtend/user_guide/regressor/StackingRegressor/

rhiever commented 7 years ago

Dropping an idea here while it's on my mind: Maybe the original approach to TPOTEnsemble is not good because it requires too many expensive evaluations every generation. Perhaps a better approach would be similar to what @lacava does in FEW:

1) Take entire TPOT population and stack the outputs into a feature matrix 2) Fit a regularized (Lasso, preferably) linear model on the feature matrix 3) Use the linear model coefficients as the fitness of each pipeline

After the first generation, all pipelines with a 0 coefficient will be removed from the TPOT ensemble.

At generation 1 (and beyond), all pipelines in the new population will be added to the TPOT ensemble along with the surviving pipelines currently in the TPOT ensemble. Stack all of the outputs, fit a regularized linear model, and again use the coefficients as the fitness.

Maybe something we can collaborate on, @lacava?

lacava commented 7 years ago

@rhiever sounds like a good idea. you could use it with any method that admits some kind of feature score, e.g. lasso, random forests, etc.. and perhaps even with stacking if stacking can be made to score the models it uses in its ensemble.

jonathanng commented 7 years ago

Another strategy would be to use a randomized forest and use the importance weights as the fitness.