EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.58k stars 1.55k forks source link

Error while fitting #1299

Open mattia-lecci opened 1 year ago

mattia-lecci commented 1 year ago

I used TPOTRegressor on my dataset, adding and removing features from the input data for different tests. When using all 18 features of my 28 datapoints and sample_weight, TPOT fails to fit with a ValueError. This doesn't happen when removing the sample_weight.

The error also doesn't happen in the same dataset using, for example, only 10 features of those 18, or in a different dataset with 8 features and 55 data points.

Process to reproduce the issue

I'm afraid i cannot share the data. This is a mockup of the code used:

import pandas as pd
import tpot

# load data
train_x: pd.DataFrame (28, 18)
train_y: pd.Series (28,)
train_weight: pd.Series (28,)

model= tpot.TPOTRegressor(generations=50, population_size=20, cv=5, random_state=42, verbosity=2)
model.fit(features=train_x, target=train_y, sample_weight=train_weight)

The same result is obtained when using .values on the pandas variables.

Yields:


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File .\.venv\lib\site-packages\tpot\base.py:816, in TPOTBase.fit(self, features, target, sample_weight, groups)
    815         warnings.simplefilter("ignore")
--> 816         self._pop, _ = eaMuPlusLambda(
    817             population=self._pop,
    818             toolbox=self._toolbox,
    819             mu=self.population_size,
    820             lambda_=self._lambda,
    821             cxpb=self.crossover_rate,
    822             mutpb=self.mutation_rate,
    823             ngen=self.generations,
    824             pbar=self._pbar,
    825             halloffame=self._pareto_front,
    826             verbose=self.verbosity,
    827             per_generation_function=self._check_periodic_pipeline,
    828             log_file=self.log_file_,
    829         )
    831 # Allow for certain exceptions to signal a premature fit() cancellation

File .\.venv\lib\site-packages\tpot\gp_deap.py:228, in eaMuPlusLambda(population, toolbox, mu, lambda_, cxpb, mutpb, ngen, pbar, stats, halloffame, verbose, per_generation_function, log_file)
    226     initialize_stats_dict(ind)
--> 228 population[:] = toolbox.evaluate(population)
    230 record = stats.compile(population) if stats is not None else {}

File .\.venv\lib\site-packages\tpot\base.py:1531, in TPOTBase._evaluate_individuals(self, population, features, target, sample_weight, groups)
   1530 self._stop_by_max_time_mins()
-> 1531 val = partial_wrapped_cross_val_score(
   1532     sklearn_pipeline=sklearn_pipeline
   1533 )
   1534 result_score_list = self._update_val(val, result_score_list)

File .\.venv\lib\site-packages\stopit\utils.py:145, in base_timeoutable.__call__..wrapper(*args, **kwargs)
    144     # ``result`` may not be assigned below in case of timeout
--> 145     result = func(*args, **kwargs)
    146 return result

File .\.venv\lib\site-packages\tpot\gp_deap.py:416, in _wrapped_cross_val_score(sklearn_pipeline, features, target, cv, scoring_function, sample_weight, groups, use_dask)
    393 """Fit estimator and compute scores for a given dataset split.
    394 
    395 Parameters
   (...)
    414     Whether to use dask
    415 """
--> 416 sample_weight_dict = set_sample_weight(sklearn_pipeline.steps, sample_weight)
    418 features, target, groups = indexable(features, target, groups)

File .\.venv\lib\site-packages\tpot\operator_utils.py:111, in set_sample_weight(pipeline_steps, sample_weight)
    110 for (pname, obj) in pipeline_steps:
--> 111     if inspect.getargspec(obj.fit).args.count("sample_weight"):
    112         step_sw = pname + "__sample_weight"

File ~\AppData\Local\Programs\Python\Python310\lib\inspect.py:1245, in getargspec(func)
   1244 if kwonlyargs or ann:
-> 1245     raise ValueError("Function has keyword-only parameters or annotations"
   1246                      ", use inspect.signature() API which can support them")
   1247 return ArgSpec(args, varargs, varkw, defaults)

ValueError: Function has keyword-only parameters or annotations, use inspect.signature() API which can support them

During handling of the above exception, another exception occurred:

RuntimeError                              Traceback (most recent call last)
Cell In[58], line 4
      1 import tpot
      3 tmp = tpot.TPOTRegressor(generations=50, population_size=20, cv=5, random_state=seed, verbosity=2)
----> 4 tmp.fit(features=train_x.values, target=train_y.values, sample_weight=train_weight.values)
      5 # tpot_train_y = tmp.predict(train_x)
      6 # tpot_test_y = tmp.predict(test_x)

File .\.venv\lib\site-packages\tpot\base.py:863, in TPOTBase.fit(self, features, target, sample_weight, groups)
    860     except (KeyboardInterrupt, SystemExit, Exception) as e:
    861         # raise the exception if it's our last attempt
    862         if attempt == (attempts - 1):
--> 863             raise e
    864 return self

File .\.venv\lib\site-packages\tpot\base.py:854, in TPOTBase.fit(self, features, target, sample_weight, groups)
    851 if not isinstance(self._pbar, type(None)):
    852     self._pbar.close()
--> 854 self._update_top_pipeline()
    855 self._summary_of_best_pipeline(features, target)
    856 # Delete the temporary cache before exiting

File .\.venv\lib\site-packages\tpot\base.py:961, in TPOTBase._update_top_pipeline(self)
    957             self._last_optimized_pareto_front_n_gens = 0
    958 else:
    959     # If user passes CTRL+C in initial generation, self._pareto_front (halloffame) shoule be not updated yet.
    960     # need raise RuntimeError because no pipeline has been optimized
--> 961     raise RuntimeError(
    962         "A pipeline has not yet been optimized. Please call fit() first."
    963     )

RuntimeError: A pipeline has not yet been optimized. Please call fit() first.

Expected result

Without using sample_weight:

Generation 1 - Current best internal CV score: -0.10226660695789169

Generation 2 - Current best internal CV score: -0.10226660695789169

Generation 3 - Current best internal CV score: -0.08510081133846376

...

Generation 50 - Current best internal CV score: -0.07952325321214902

Best pipeline: AdaBoostRegressor(Nystroem(ExtraTreesRegressor(PolynomialFeatures(input_matrix, degree=2, include_bias=False, interaction_only=False), bootstrap=False, max_features=0.05, min_samples_leaf=5, min_samples_split=12, n_estimators=100), gamma=0.75, kernel=polynomial, n_components=10), learning_rate=0.01, loss=linear, n_estimators=100)

Environment

OS: Windows 10 Python 3.10.5 TPOT==0.11.7 pandas==1.5.3 numpy==1.24.2