EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.75k stars 1.57k forks source link

TPOT optimization gets stuck #1107

Open hanshupe opened 4 years ago

hanshupe commented 4 years ago

I experience that the optimization gets stuck and is not improving anymore after many hours while a new start sometimes gives a better score at the initial population already (!). The dataset is not very complex (50 variables, regression problem, 5000 rows) and i used default settings but also played around with the parameters of the genetic optimization.

What I noticed is that especially when I have poor performance, my log gets flooded with the following messages:

_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler..
_pre_test decorator: _random_mutation_operator: num_test=1 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler..
_pre_test decorator: _random_mutation_operator: num_test=2 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler..
_pre_test decorator: _random_mutation_operator: num_test=3 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler..
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by `RobustScaler.

Sometimes I get that warnings 80 times for a population of 100 pipelines. What exactly does the message mean and how can I improve the optimization?

Btw.: Is it possible to pass some initial pipelines to TPOT to start with a better initial population?

weixuanfu commented 4 years ago

TPOT may randomly generate invalid pipelines, for example using invalid hyperparameter combinations (eg. calling Logistic Regression with dual=True and penalty=L1), so, to avoild this, the pre_test decorator can evaluates such an pipeline on a small test set with maximum sample size of 50.

So far you may try different random_state for better initial population which should be reproduced with the same random_state.

I think it is great idea to pass some good initial pipelines. There are a related issue #296 but we did not implement in a ideal way (there was a related PR #502 but we revoked changes there). Any contributions are welcome for this new features.

hanshupe commented 4 years ago

Is a possible reason for "_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler." that due to some feature selection hyperparameter 0 features are passed to the next step? I will try to change the minimum threshold, maybe that helps.

So if a pipeline is detected as invalid, is still kept in the population with a low score? In my case it looks like 80% of the pipeline is full of invalid pipelines sometimes, ideally they should not propagate through so many generations.

weixuanfu commented 4 years ago

Is a possible reason for "_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler." that due to some feature selection hyperparameter 0 features are passed to the next step? I will try to change the minimum threshold, maybe that helps.

Yes, I think so. Changing the configuration of those feature selection operator may help.

So if a pipeline is detected as invalid, is still kept in the population with a low score? In my case it looks like 80% of the pipeline is full of invalid pipelines sometimes, ideally they should not propagate through so many generations.

The invalid pipelines tested in _pre_test decorator should not pass to population unless the newly-generated pipelines from one alteration in GP (crossover or mutation or randomly initial generation) failed ten times in _pre_test. So the population should not have those invalid pipelines in most of cases.