EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.68k stars 1.56k forks source link

Pipeline Complexity over time - How to approach a solution #783

Open GinoWoz1 opened 5 years ago

GinoWoz1 commented 5 years ago

Since TPOT allows lots of freedom for the solution spaces it searches, sometimes TPOT can explore unnecessarily complex pipelines.

Possible fix

Last week I met with @weixuanfu and Trang to discuss how we could solve this problem by looking at Pipeline complexity. Based on our conversation we had the ideas below, I am interested in other's ideas.

Ideas

1) scoring pipeline complexity by number of features (of end features set vs original) and add complexity penalty 2) scoring pipeline complexity by the number of parameters in a pipeline and add a respective penalty 3) Provide a validation or test data set in addition to the train data set . Once the best model is scored using cross validation, that same model can be refit against the validation set. If the performance improves on the train set but drops in the validation set, there could be a complexity penalty put on that respective pipeline.

adrose commented 5 years ago

I think one issue with solution 1. is that any time time you look for nonlinearity using PolynomialFeatures or something of the like, it would assume a large penalty. This is a common problem with the data I work with.

I think one potentially expensive solution would be to look at the stability of the covariance of predictors assuming solution #3 is implemented - but I am not sure the best way to fully tease this out?

Or looking at rank differences of importance metrics again to build off of #3?

I'm going to spend some more time thinking on this, I have run into similar problems with TPOT building very long complicated pipelines - these were just two potentially terrible ideas that came to mind.

GinoWoz1 commented 5 years ago

Good point on number 1. On number 2, can you please elaborate? On number 3, are you talking about the CV fit on the training set, test set and then the second idea you are talking of? Interesting in what you are thinking.

GinoWoz1 commented 5 years ago

Pushing this further.

Regarding complexity scoring, how would a complexity score be used for future generations development? Currently there is the operator count which comes out through the Pareto scores. Would something akin to an additional penalty to the loss score be a feasible idea to limit the generational development of overly complex pipelines or would this need to be approached another way? Curious to hear any thoughts.

adrose commented 5 years ago

Sorry should have elaborated more from the beginning.

My idea for looking at the covariance matrices would best apply for decomposition techniques that may have more than one "best" solution. Something like ICA may yield multiple best solutions. The idea of picking this up by looking for variance, across covariance matrices would be a simple solution to see the stability of the decomposition techniques.

However this would probably be better caught by the CV accuracy scores, so after spending more time thinking about this, it may not be a very good solution.

For my third idea, I thought looking at stability of importance metrics, within the CV framework TPOT already performs it's model building in, could yield some framework for seeing the stability of the model predictors on top of the prediction accuracy. I wouldn't expect many large jumps across internal CV importance metrics, but big jumps could point to some potential problems.

I think a complexity score could be used as a probability to weight a specific pipeline's offspring from spawning w/in the GP framework. As described in the TPOT paper, the top 10% of pipelines from generation 1 are placed into generation 2 in the presented example. We could compute a weighted average, or something of the like, between our "complexity score", and the accuracy score, thus allowing more simple pipelines, which may not perform as well, an opportunity to propagate.

I am no expert in genetic programming, just a massive fan, I am not sure if an expert like @rhiever or @weixuanfu would like to chime in on the feasibility and utility of implementing something like this.

LMK what is not clear.

weixuanfu commented 5 years ago

@GinoWoz1 @adrose thank both of you for posting your ideas here.

For the 1st idea that @GinoWoz1 mentioned, I understand that PolynomialFeatures will have larger penalty in pipeline complexity and actually in TPOT, there are some hard codes to limit the number of PolynomialFeatures <= 1 in pipelines because it can double features numbers and make the pipeline evaluation very computational expensive.

But if there is one feature selection step to reducing feature number before PolynomialFeatures step in a pipeline, then it will not be a problem for using 1 or more PolynomialFeatures in the pipelines.

So we are thinking about using the # feature output from each operator/ # feature in input data to weight the pipeline complexity. In this situation, feature selectors should has less penalty and PolynomialFeatures in the very beginning of pipeline should have more penalty.

Please let me know if you have more thoughts about this idea or other ones. Any thoughts will be helpful. Thanks again!

GinoWoz1 commented 5 years ago

Thanks Weixuan. So we will need to loop through each of the elements in the stacked pipeline and or just calculate it at the end?

One issue I ran into recently is that there was feature selection done on my pipeline which overfit my training set (it was SelectVariance threshold of 0.005 or something). How do we handle pipeline complexity where we don't necessarily have a proliferation of features but specific parameters are memorizing our training data set?

weixuanfu commented 5 years ago

The best way is to weight each of the elements in the stacked pipeline without refitting partial pipeline.

VarianceThreshold removes all features with a training-set variance lower than this threshold and the reason of overfitting maybe features' variances (or sample size) are different in testing set. I think cross_eval_score used in optimization of TPOT should prevent it with 5-fold CV by default but you may specify a set of train, test splits via cv for this issue.

GinoWoz1 commented 5 years ago

Thanks. Can you help me illustrate with the pipeline below what you are suggesting? This is for my understanding on what you are suggesting.

In the example below I have 178 features feeding into this pipeline raw.

exported_pipeline = make_pipeline( StackingEstimator(estimator=RidgeCV()), 1 feat StackingEstimator(estimator=GradientBoostingRegressor(alpha=0.95, learning_rate=0.1, loss="lad", max_depth=3, max_features=0.15000000000000002, min_samples_leaf=20, min_samples_split=15, n_estimators=100, subsample=0.6500000000000001)), 2 feat StackingEstimator(estimator=ElasticNetCV(l1_ratio=0.8, tol=1e-05)), 3 feat StackingEstimator(estimator=DecisionTreeRegressor(max_depth=5, min_samples_leaf=2, min_samples_split=9)), # 4 feat 4 feat XGBRegressor(learning_rate=0.1, max_depth=6, min_child_weight=2, n_estimators=100, nthread=8, subsample=0.9500000000000001) #final model )

How are you thinking we should calculate this? Also what are your ideas on weighting the pipelines (or penalizing them) for the next generation? Taking the loss score and multiplying it by the "# feature output from each operator/ # feature in input data" weight?

Example

Operator 1: 179/178 Operator 2: 180/179 ( + prediction from first operator) Operator 3: 181/180 ( + prediction from first 2 operators) Operator 4: 182/181 ( raw + predictions from first 3 operators)

722/718 or 102% of the original features.

weixuanfu commented 5 years ago

I think the complexity is (179+180+181+182)/178 = 4.056. The complexity of this kind of pipeline (without feature selection) is similar to the number of operators (5) used in the current version TPOT (v0.9.5), but there is a little bit penalty in those stacking estimators. If there is feature selector in the pipeline then the complexity should be lower due to reducing the number of features.

GinoWoz1 commented 5 years ago

Awesome thanks for the explanation. In terms of the top 10% of pipelines being passed on to the next generation, how would this penalty apply? Would you apply a multiplier to the loss score in this case or weight it by the complexity as to weed out those pipelines?

For # of features, how would this method maintain the depth of the search if we penalize the # of variables? Some of my best pipelines have been the pareto length of 4 or 5. If we weight the # of predictors too much then TPOT may only be stuck exploring pareto lengths of 1 and 2.

For the template idea you talked about, this would be a non-issue since you can set the template to do 1) Feature Transformation 2) Feature selection then 3) modelling and you should, in theory, get the best models with the least amount of predictors.

Interested in hearing your thoughts.

weixuanfu commented 5 years ago

TPOT will still use NSGA-II selection for 2-objective optimization (score and complexity) to sorting pipelines and then pass top 10% of pipeline to the next generation. I am not worried that pareto length will be too short if we penalize the # of variables. On the counter, if the simple pipeline with less intermediate features after feature selection step can get a similar accuracy score with a pipeline using all features. We are more interested to study those intermediate features.