EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.75k stars 1.57k forks source link

TPOT underpopulates a class, but manual sklearn does not #1344

Open jecorn opened 9 months ago

jecorn commented 9 months ago

Context of the issue

I have a large and imbalanced binary classification dataset: approx 2,000,000 negative cases and 5,000 positive cases, with 45 features. I have been running manual sklearn pipelines on this dataset without problem for a while. My manual work includes StratifiedKFold cross validation on algorithms such as random forests, gradient boosting, MLP, and more. All of the manual work has been fine.

I recently learned of TPOT (what an awesome idea! huge thanks, devs!) and was excited to give it a try. But on the exact same dataset, I'm getting an error The least populated class in y has only 1 members, which is less than n_splits=5. This happens after about 50-60 TPOT iterations. I'm using stratification in train_test_split, and it's just a binary classification. So I'm not sure how a split could end up underpopulated. It's also strange that this same dataset works fine manually with stratification/splitting that (so far as I understand) is identical to what TPOTClassifier uses.

I saw a few other reports of this error both for sklearn and TPOT. But it was always on multilabel classification. So I'm a bit stumped.

TPOT script

#!/usr/bin/env python
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
import pandas as pd
import argparse

def get_options():
    parser = argparse.ArgumentParser()
    parser.add_argument('-f', "--file", required=True, type=str, help="input feature/label file")
    parser.add_argument("--ncpu", default=1, type=int, help="number of cpus to use")
    args = parser.parse_args()
    return args

def load_dataset(filename):
    # load the dataset as a pandas dataframe
    data = pd.read_csv(filename, sep=",")
    data = data.drop(["Chromosome","Start","End","OTseq","guide","Strand"], axis = 'columns')
    # split into feature and label elements, where the label is named "autodisco_classifier"
    X, y = data.drop(["autodisco", "autodisco_classifier"], axis = 'columns'), data["autodisco_classifier"]
    return X, y

args = get_options()
fname = args.file
X, y = load_dataset(fname)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y) # I also tried no stratification here

cv = StratifiedKFold(n_splits=5)
pipeline_optimizer = TPOTClassifier(generations=5, population_size=50, cv=cv, # I also tried just using cv=5
                                    random_state=42, verbosity=2, n_jobs=args.ncpu) # Right now I'm just using one cpu, n_jobs=1
pipeline_optimizer.fit(X_train, y_train)
print(pipeline_optimizer.score(X_test, y_test))
pipeline_optimizer.export('tpot_exported_pipeline.py')

Error

Generation 1 - Current best internal CV score: -inf
/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/model_selection/_split.py:737: UserWarning: The least populated class in y has only 1 members, which is less than n_splits=5.
  warnings.warn(
Traceback (most recent call last):
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/base.py", line 817, in fit
    self._pop, _ = eaMuPlusLambda(
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/gp_deap.py", line 285, in eaMuPlusLambda
    per_generation_function(gen)
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/base.py", line 1183, in _check_periodic_pipeline
    self._update_top_pipeline()
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/base.py", line 935, in _update_top_pipeline
    raise RuntimeError(
RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly. If you enabled PyTorch estimators, please check the data requirements in the online documentation: https://epistasislab.github.io/tpot/using/

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/data/jcorn/autodisco/ml/testing/plate2/filtered/learn/../../../../scripts/auto_tpot.py", line 34, in <module>
    pipeline_optimizer.fit(X_train, y_train)
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/base.py", line 864, in fit
    raise e
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/base.py", line 855, in fit
    self._update_top_pipeline()
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/base.py", line 935, in _update_top_pipeline
    raise RuntimeError(
RuntimeError: There was an error in the TPOT optimization process. This could be because the data was not formatted properly, or because data for a regression problem was provided to the TPOTClassifier object. Please make sure you passed the data to TPOT correctly. If you enabled PyTorch estimators, please check the data requirements in the online documentation: https://epistasislab.github.io/tpot/using/
perib commented 8 months ago

what version of tpot are you using? I was able to reproduce the issue in version 0.12.0, but not 0.12.2 (yet). I haven't nailed down exactly what the issue was but it seems to work for me on the latest version.

jecorn commented 8 months ago

This was with tpot 0.12.1. I just grabbed tpot 1.12.2 and will update when I get a chance to try it out it.

jecorn commented 8 months ago

This was with tpot 0.12.1. I just grabbed tpot 1.12.2 and will update when I get a chance to try it out it.

tpot 0.12.2 can now get past the error. Thanks!

Unfortunately, now there's a different error that I think might be hard to troubleshoot. Running on a single core, tpot starts going through pipelines. But when parallelizing one of the works throws an exception. It might be during a later pipelines (since it doesn't happen on a single core). I'll try to do some digging.

"""
Traceback (most recent call last):
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 463, in _process_worker
    r = call_item()
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py", line 291, in __call__
    return self.fn(*self.args, **self.kwargs)
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/joblib/parallel.py", line 589, in __call__
    return [func(*args, **kwargs)
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/joblib/parallel.py", line 589, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/stopit/utils.py", line 145, in wrapper
    result = func(*args, **kwargs)
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/gp_deap.py", line 424, in _wrapped_cross_val_score
    cv_iter = list(cv.split(features, target, groups))
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/model_selection/_split.py", line 808, in split
    y = check_array(y, input_name="y", ensure_2d=False, dtype=None)
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/sklearn/utils/validation.py", line 1097, in check_array
    array.flags.writeable = True
ValueError: cannot set WRITEABLE flag to True of this array
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/base.py", line 817, in fit
    self._pop, _ = eaMuPlusLambda(
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/gp_deap.py", line 232, in eaMuPlusLambda
    population[:] = toolbox.evaluate(population)
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/base.py", line 1575, in _evaluate_individuals
    tmp_result_scores = parallel(
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/joblib/parallel.py", line 1952, in __call__
    return output if self.return_generator else list(output)
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/joblib/parallel.py", line 1595, in _get_outputs
    yield from self._retrieve()
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/joblib/parallel.py", line 1699, in _retrieve
    self._raise_error_fast()
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/joblib/parallel.py", line 1734, in _raise_error_fast
    error_job.get_result(self.timeout)
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/joblib/parallel.py", line 736, in get_result
    return self._return_or_raise()
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/joblib/parallel.py", line 754, in _return_or_raise
    raise self._result
ValueError: cannot set WRITEABLE flag to True of this array

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/data/jcorn/autodisco/scripts/auto_tpot.py", line 39, in <module>
    pipeline_optimizer.fit(X, y)
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/base.py", line 864, in fit
    raise e
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/base.py", line 855, in fit
    self._update_top_pipeline()
  File "/home/cornlab/miniconda3/envs/jcorn/lib/python3.10/site-packages/tpot/base.py", line 963, in _update_top_pipeline
    raise RuntimeError(
RuntimeError: A pipeline has not yet been optimized. Please call fit() first.
perib commented 7 months ago

I've come across this error at some point too, but it seems to be working on my machine now. IIRC it was due to a package version issue. Try updating the packages, I think my issue was with an outdated version of pandas or numpy?