EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.72k stars 1.57k forks source link

Custom Scorer (WMAE) fails TPOT optimization process #1063

Open ngrajales1 opened 4 years ago

ngrajales1 commented 4 years ago

Passing my own scorer that calculates weighted mean absolute error for a regression problem, results in error

Context of the issue

I followed the instructions from the TPOT documentation page to create my own scorer for a regression problem (I am trying to use tpot for Walmart kaggle competition https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting). The scorer I created calculates the weighted mean absolute error. I am pasting my code below:

from sklearn.metrics import mean_absolute_error

def mape_score(y_test, y_pred, weights):
    mean_absolute_error(y_test, y_pred, sample_weight= weights)

from sklearn.metrics import make_scorer 
my_custom_scorer = make_scorer(mape_score, greater_is_better=False)

tpot = TPOTRegressor(generations=5, population_size=50, verbosity=2, random_state=42, n_jobs=-1, scoring=my_custom_scorer)

with joblib.parallel_backend("dask"):
    tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

weights & y_test & y_pred is type pandas.core.series.Series

I am also using local dask cluster to distribute my workload. Please let me know if it is a user error or something that may need to be looked into.

Current result


RuntimeError Traceback (most recent call last) /opt/anaconda3/envs/Nelson_Dask/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups) 699 warnings.simplefilter('ignore') --> 700 self.pop, = eaMuPlusLambda( 701 population=self._pop,

/opt/anaconda3/envs/Nelson_Dask/lib/python3.8/site-packages/tpot/gpdeap.py in eaMuPlusLambda(population, toolbox, mu, lambda, cxpb, mutpb, ngen, pbar, stats, halloffame, verbose, per_generation_function) 235 if per_generation_function is not None: --> 236 per_generation_function(gen) 237 # Vary the population

/opt/anaconda3/envs/Nelson_Dask/lib/python3.8/site-packages/tpot/base.py in _check_periodic_pipeline(self, gen) 1002 """ -> 1003 self._update_top_pipeline() 1004 if self.periodic_checkpoint_folder is not None:

/opt/anaconda3/envs/Nelson_Dask/lib/python3.8/site-packages/tpot/base.py in _update_top_pipeline(self) 792 if not self._optimized_pipeline: --> 793 raise RuntimeError('There was an error in the TPOT optimization ' 794 'process. This could be because the data was '

RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly.

During handling of the above exception, another exception occurred:

RuntimeError Traceback (most recent call last)

in 1 with joblib.parallel_backend("dask"): ----> 2 tpot.fit(X_train, y_train) 3 print(tpot.score(X_test, y_test)) 4 tpot.export('tpot_Nelson_pipeline.py') /opt/anaconda3/envs/Nelson_Dask/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups) 740 # raise the exception if it's our last attempt 741 if attempt == (attempts - 1): --> 742 raise e 743 return self 744 /opt/anaconda3/envs/Nelson_Dask/lib/python3.8/site-packages/tpot/base.py in fit(self, features, target, sample_weight, groups) 731 self._pbar.close() 732 --> 733 self._update_top_pipeline() 734 self._summary_of_best_pipeline(features, target) 735 # Delete the temporary cache before exiting /opt/anaconda3/envs/Nelson_Dask/lib/python3.8/site-packages/tpot/base.py in _update_top_pipeline(self) 791 792 if not self._optimized_pipeline: --> 793 raise RuntimeError('There was an error in the TPOT optimization ' 794 'process. This could be because the data was ' 795 'not formatted properly, or because data for ' RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly.
weixuanfu commented 4 years ago

I think the main problem here is that weights in make_score(y_test, y_pred, weights) can not be used correctly in K-fold CV since in each fold the samples in y_test are different so that weights should be matched to that. I think it is related to the #1039 and I have a hacky demo that may help you.