EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.74k stars 1.57k forks source link

Questions about the StackingEstimator #690

Open avdusen opened 6 years ago

avdusen commented 6 years ago

I ran a short regression test with a small data set. Here is the TPOT input: tpot_optimizer = TPOTRegressor(generations=5, population_size=20, scoring='neg_median_absolute_error',cv=5, random_state=42, verbosity=2)

Here is the best pipeline output: Best pipeline: ExtraTreesRegressor(XGBRegressor(LassoLarsCV(PolynomialFeatures(RidgeCV(input_matrix), degree=2, include_bias=False, interaction_only=False), normalize=True), learning_rate=0.1, max_depth=2, min_child_weight=4, n_estimators=100, nthread=1, subsample=0.5), bootstrap=True, max_features=0.45, min_samples_leaf=6, min_samples_split=15, n_estimators=100)

Here is the relevant part of the exported python file: exported_pipeline = make_pipeline( StackingEstimator(estimator=RidgeCV()), PolynomialFeatures(degree=2, include_bias=False, interaction_only=False), StackingEstimator(estimator=LassoLarsCV(normalize=True)), StackingEstimator(estimator=XGBRegressor(learning_rate=0.1, max_depth=2, min_child_weight=4, n_estimators=100, nthread=1, subsample=0.5)), ExtraTreesRegressor(bootstrap=True, max_features=0.45, min_samples_leaf=6, min_samples_split=15, n_estimators=100) )

Question 1: Is the following interpretation of the order of steps used correct?

  1. raw attributes -> RidgeCV -> predictions
  2. raw attributes -> PolynomialFeatures -> LassoLarsCV -> predictions
  3. raw attributes -> PolynomialFeatures (?) - XGBRegressor -> predictions
  4. prediction 1, prediction 2, prediction 3 -> ExtraTreesRegressor -> final predictions
    • is this using 1,2,3 in parallel then using ExtraTreesRegressor as the metalearning for stacking?
    • is PolynomialFeatures applied to both LassoLarsCV and XGBRegressor or only to the former

Question 2: Is it possible to turn off stacking?

weixuanfu commented 6 years ago

For Question 1. The steps are:

  1. raw attributes -> RidgeCV -> 1st prediction
  2. raw attributes + 1st predictions-> PolynomialFeatures -> 1st transformed attributes --> LassoLarsCV -> 2nd predictions
  3. 1st transformed attributes + 2nd predictions -> XGBRegressor -> 3rd predictions
  4. 1st transformed attributes + 2nd predictions + 3rd predictions -> ExtraTreesRegressor -> final predictions

For Question 2.

For now, TPOT does not provide this options. But:

One of my dev branch of TPOT called noCDF_noStacking has a option named simple_pipeline, which can disable both StackingEstimator and CombineDFs if simple_pipeline=True (e.g. TPOTClassifier(simple_pipeline=True)). But it is noted that this dev branch is not fully tested yet. If you want to try TPOT without StackingEstimator and FeatureUnion, you may install this branch in your test environment via the command below:

pip install --upgrade --no-deps --force-reinstall git+https://github.com/weixuanfu/tpot.git@noCDF_noStacking

Please check #152 for more details. We are working on a more advanced pipeline configuration option.

avdusen commented 6 years ago

Weixuanfu thank you for your prompt answer.

You may want to add this explanation to the documents. Also, here is something to add to what I am sure is a large "to do" list: use Graphviz to print out a tree structure image of the best pipeline. This would make it easier for the user to understand the data flow in the pipeline.