Open hanshupe opened 4 years ago
TPOT may randomly generate invalid pipelines, for example using invalid hyperparameter combinations (eg. calling Logistic Regression with dual=True and penalty=L1), so, to avoild this, the pre_test decorator can evaluates such an pipeline on a small test set with maximum sample size of 50.
So far you may try different random_state
for better initial population which should be reproduced with the same random_state
.
I think it is great idea to pass some good initial pipelines. There are a related issue #296 but we did not implement in a ideal way (there was a related PR #502 but we revoked changes there). Any contributions are welcome for this new features.
Is a possible reason for "_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler." that due to some feature selection hyperparameter 0 features are passed to the next step? I will try to change the minimum threshold, maybe that helps.
So if a pipeline is detected as invalid, is still kept in the population with a low score? In my case it looks like 80% of the pipeline is full of invalid pipelines sometimes, ideally they should not propagate through so many generations.
Is a possible reason for "_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by RobustScaler." that due to some feature selection hyperparameter 0 features are passed to the next step? I will try to change the minimum threshold, maybe that helps.
Yes, I think so. Changing the configuration of those feature selection operator may help.
So if a pipeline is detected as invalid, is still kept in the population with a low score? In my case it looks like 80% of the pipeline is full of invalid pipelines sometimes, ideally they should not propagate through so many generations.
The invalid pipelines tested in _pre_test decorator should not pass to population unless the newly-generated pipelines from one alteration in GP (crossover or mutation or randomly initial generation) failed ten times in _pre_test. So the population should not have those invalid pipelines in most of cases.
I experience that the optimization gets stuck and is not improving anymore after many hours while a new start sometimes gives a better score at the initial population already (!). The dataset is not very complex (50 variables, regression problem, 5000 rows) and i used default settings but also played around with the parameters of the genetic optimization.
What I noticed is that especially when I have poor performance, my log gets flooded with the following messages:
Sometimes I get that warnings 80 times for a population of 100 pipelines. What exactly does the message mean and how can I improve the optimization?
Btw.: Is it possible to pass some initial pipelines to TPOT to start with a better initial population?