EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.67k stars 1.56k forks source link

warm_start on new data set, keep population but update scores #881

Open DreHar opened 5 years ago

DreHar commented 5 years ago

Hi all,

I am using TPOT to build pipelines that will generalise across several data sets. I would like to use the 'warm_start' parameter to re-initialise with the same best pipelines in my population but re-score them. Currently warm_start would look to see if the pipeline has been evaluated and use the past score (relating to the old data set).

I am not sure where would be best to make the change; perhaps a parameter in fit such that if warm_start is set we would go and re-evaluate current best population and ignore clear past pipeline scores? If anyone could point me in the direction of where I could make a change, or a better way to do this that would be awesome.

Thanks!

Drew

weixuanfu commented 5 years ago

Thank you for the idea.

TPOT may need add this refit option for warm_start to discard the values from previous run.

There is a hacky way to refit tpot on another dataset. Please check the demo below:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
                                                    digits.data, 
                                                    digits.target,
                                                    train_size=0.5, 
                                                    test_size=0.5, 
                                                    random_state=42
                                                    )

tpot = TPOTClassifier(
                        generations=3, 
                        population_size=10, 
                        verbosity=2, 
                        random_state=42, 
                        warm_start=True
                        )

tpot.fit(X_train, y_train)

# remove fitness values in previous runs
for ind in tpot._pop:
    del ind.fitness.values

# refit tpot to another dataset
tpot.fit(X_test, y_test)
DreHar commented 5 years ago

Thank you so much for the reply; a refit option would be really powerful. But this is great and looks to be just what I need!

Once the tpot._pop is cleared; on the refit run I get a lot of the following warnings (setting verbosity=3, using my own data, as well as the digits data in your example) . Can these be ignored? _pre_test decorator: _random_mutation_operator: num_test=0 a must not be a non-empty

When using verbosity=3 it also looks like the pareto front scores are kept between fits. I'm actually not sure, this could be a desirable feature. But it might be surprising because the score across data sets may not be equivalent. Similarly, resetting the pareto front across calls to fit() isn't completely accurate either. I am not sure how this would be best handled if a refit() was to be implemented

weixuanfu commented 5 years ago

_pre_test decorator: _random_mutation_operator: num_test=0 a must not be a non-empty this warning message can be ignored since _pre_test decorator just test the invalid pipelines and replace them with valid ones.

I think resetting pareto front for different dataset is necessary beacuse the score across datasets should be different. I added some codes into the demo for resetting pareto front:

# remove fitness values in previous runs
for ind in tpot._pop:
    del ind.fitness.values

# reset pareto front
tpot._last_optimized_pareto_front = None
tpot._last_optimized_pareto_front_n_gens = 0
tpot._pareto_front = None

# refit tpot to another dataset
tpot.fit(X_test, y_test)
DreHar commented 5 years ago

It looks like the above demo won't clear out the hallofame which looks to be a deap object; so when I do the refit I can get stuck on the pipeline that had the best score on any of the past data sets used. For example, if you take the above demo and swap the train and test set around (because you can get a better CV score on the test than train, I also simplified the param set so it was a bit simpler to see whats going on, and made the generations excessive).

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(
                        digits.data,
                        digits.target,
                        train_size=0.5,
                        test_size=0.5,
                        random_state=42
                        )

params = {
'sklearn.tree.DecisionTreeClassifier': {
            'criterion': ["gini"],
            'max_depth': range(1, 10),
            'min_samples_split': range(2, 10),
            'min_samples_leaf': range(1, 10)
        }
}

tpot = TPOTClassifier(
                generations=3,
                population_size=1000,
                verbosity=2,
                random_state=42,
                warm_start=True,
                config_dict=params
                )

tpot.fit(X_test, y_test)

for ind in tpot._pop:
    del ind.fitness.values

tpot.fit(X_train, y_train)
DreHar commented 5 years ago

Fantastic! Thank you so much. That fixes my above comment as well. If I understand correctly, because the pareto_front for tpot will be the halloffame.

weixuanfu commented 5 years ago

Yes, pareto_front is the halloffame.