EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.74k stars 1.57k forks source link

My dataet crashed TOP-NN #1247

Open CBrauer opened 2 years ago

CBrauer commented 2 years ago

Hey,

I am getting a crash in TOP-NN. My envirionment is:

>python tpot-NN-rocket-classify.py
Operating system version.... Windows-10-10.0.22000-SP0
Python version is........... 3.8.13
pandas version is........... 1.4.2
numpy version is............ 1.21.5
tpot version is............. 0.11.7

I have put my code and dataset at: https://github.com/CBrauer/TPOT-NN-bug

The program is as follows:

import warnings
warnings.filterwarnings("ignore")
import platform
import sys
import pandas as pd
import numpy as np
import time
from IPython.core.display import HTML, display
pd.set_option('display.max_rows', 10)
pd.set_option('display.max_columns', 11)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

import tpot
from tpot import TPOTClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle

class Timer:
    def __init__(self):
        self.start = time.time()

    def restart(self):
        self.start = time.time()

    def get_time(self):
        end = time.time()
        m, s = divmod(end - self.start, 60)
        h, m = divmod(m, 60)
        time_str = "%02d:%02d:%02d" % (h, m, s)
        return time_str

def LoadData():

    df = pd.read_csv('rocket.csv')

    response_column = ['Altitude']
    feature_columns = ['BoxRatio', 'Thrust', 'Acceleration', 'Velocity', 'OnBalRun', 'vwapGain', 'Expect', 'Trin']
    header = feature_columns + response_column

    df_describe = df[feature_columns].describe(include='all')
    display(df_describe)

    X = df[feature_columns].values
    y = df[response_column].values.ravel()

    X_train, X_test, y_train, y_test = train_test_split(X,
                                                        y,
                                                        test_size = 0.2,
                                                        random_state = 7)
    print('Size of dataset:')
    print(' train shape... ', X_train.shape, y_train.shape)
    print(' test shape.... ', X_test.shape, y_test.shape)

    return X_train, y_train, X_test, y_test

def Main(g, p):
    X_train, y_train, X_test, y_test = LoadData()

    clf = TPOTClassifier(config_dict='TPOT NN',
                         template='Selector-Transformer-PytorchLRClassifier',
                         verbosity=2,
                         generations=g,
                         population_size=p,
                         random_state=7)
    clf.fit(X_train, y_train)
    print(clf.score(X_test, y_test))
    clf.export('tpot_nn_demo_pipeline.py')

if __name__ == "__main__":

    print('Operating system version....', platform.platform())
    print("Python version is........... %s.%s.%s" % sys.version_info[:3])
    print('pandas version is...........', pd.__version__)
    print('numpy version is............', np.__version__)
    print('tpot version is.............', tpot.__version__)

    my_timer = Timer()

    Main(10, 10)

    elapsed = my_timer.get_time()
    print("\nTotal compute time was: %s" % elapsed)

After running a while, I get the following stack trace

Generation 1 - Current best internal CV score: -inf
Optimization Progress:   2%|█▌                                                                           | 200/10100 [05                                                                                                                        Traceback (most recent call last):
  File "C:\anaconda3\lib\site-packages\tpot\base.py", line 816, in fit
    self._pop, _ = eaMuPlusLambda(
  File "C:\anaconda3\lib\site-packages\tpot\gp_deap.py", line 281, in eaMuPlusLambda
    per_generation_function(gen)
  File "C:\anaconda3\lib\site-packages\tpot\base.py", line 1176, in _check_periodic_pipeline
    self._update_top_pipeline()
  File "C:\anaconda3\lib\site-packages\tpot\base.py", line 924, in _update_top_pipeline
    cv_scores = cross_val_score(
  File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 509, in cross_val_score
    cv_results = cross_validate(
  File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 267, in cross_validate
    results = parallel(
  File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 779, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "C:\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 216, in __call__
    return self.function(*args, **kwargs)
  File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\anaconda3\lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\anaconda3\lib\site-packages\sklearn\base.py", line 855, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "C:\anaconda3\lib\site-packages\sklearn\preprocessing\_data.py", line 806, in fit
    return self.partial_fit(X, y, sample_weight)
  File "C:\anaconda3\lib\site-packages\sklearn\preprocessing\_data.py", line 841, in partial_fit
    X = self._validate_data(
  File "C:\anaconda3\lib\site-packages\sklearn\base.py", line 566, in _validate_data
    X = check_array(X, **check_params)
  File "C:\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 814, in check_array
    raise ValueError(
ValueError: Found array with 0 feature(s) (shape=(40, 0)) while a minimum of 1 is required by StandardScaler.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tpot-NN-rocket-classify.py", line 83, in <module>
    Main(100, 100)
  File "tpot-NN-rocket-classify.py", line 69, in Main
    clf.fit(X_train, y_train)
  File "C:\anaconda3\lib\site-packages\tpot\base.py", line 863, in fit
    raise e
  File "C:\anaconda3\lib\site-packages\tpot\base.py", line 854, in fit
    self._update_top_pipeline()
  File "C:\anaconda3\lib\site-packages\tpot\base.py", line 924, in _update_top_pipeline
    cv_scores = cross_val_score(
  File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 509, in cross_val_score
    cv_results = cross_validate(
  File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 267, in cross_validate
    results = parallel(
  File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 861, in dispatch_one_batch
    self._dispatch(tasks)
  File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 779, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "C:\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "C:\anaconda3\lib\site-packages\joblib\_parallel_backends.py", line 572, in __init__
    self.results = batch()
  File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in __call__
    return [func(*args, **kwargs)
  File "C:\anaconda3\lib\site-packages\joblib\parallel.py", line 262, in <listcomp>
    return [func(*args, **kwargs)
  File "C:\anaconda3\lib\site-packages\sklearn\utils\fixes.py", line 216, in __call__
    return self.function(*args, **kwargs)
  File "C:\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 390, in fit
    Xt = self._fit(X, y, **fit_params_steps)
  File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 348, in _fit
    X, fitted_transformer = fit_transform_one_cached(
  File "C:\anaconda3\lib\site-packages\joblib\memory.py", line 349, in __call__
    return self.func(*args, **kwargs)
  File "C:\anaconda3\lib\site-packages\sklearn\pipeline.py", line 893, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
  File "C:\anaconda3\lib\site-packages\sklearn\base.py", line 855, in fit_transform
    return self.fit(X, y, **fit_params).transform(X)
  File "C:\anaconda3\lib\site-packages\sklearn\preprocessing\_data.py", line 806, in fit
    return self.partial_fit(X, y, sample_weight)
  File "C:\anaconda3\lib\site-packages\sklearn\preprocessing\_data.py", line 841, in partial_fit
    X = self._validate_data(
  File "C:\anaconda3\lib\site-packages\sklearn\base.py", line 566, in _validate_data
    X = check_array(X, **check_params)
  File "C:\anaconda3\lib\site-packages\sklearn\utils\validation.py", line 814, in check_array
    raise ValueError(
ValueError: Found array with 0 feature(s) (shape=(40, 0)) while a minimum of 1 is required by StandardScaler.

H:\HedgeTools\ML_Model_Generation\TPOT>pause
Press any key to continue . . .

I hope you guys can help me with this problem

Charles

JDRomano2 commented 2 years ago

Do you see the same issue with non-NN TPOT? E.g., if you omit config_dict='TPOT NN'?

CBrauer commented 2 years ago

OK, Is this what you wanted?

def Main(g, p):
    X_train, y_train, X_test, y_test = LoadData()

    # clf = TPOTClassifier(config_dict='TPOT NN',
    clf = TPOTClassifier(template='Selector-Transformer-PytorchLRClassifier',
                         verbosity=2,
                         generations=g,
                         population_size=p,
                         random_state=7)
    clf.fit(X_train, y_train)
    print(clf.score(X_test, y_test))
    clf.export('tpot_nn_demo_pipeline.py')

Now I get:

H:\HedgeTools\ML_Model_Generation\TPOT-NN>python tpot-NN-rocket-classify.py
Operating system version.... Windows-10-10.0.22000-SP0
Python version is........... 3.8.13
pandas version is........... 1.4.2
numpy version is............ 1.21.5
tpot version is............. 0.11.7
           BoxRatio        Thrust  Acceleration      Velocity      OnBalRun      vwapGain        Expect          Trin
count  60000.000000  60000.000000  60000.000000  60000.000000  60000.000000  60000.000000  60000.000000  60000.000000
mean       2.061707      1.677448      1.935544      0.635225      2.412940      0.984372     -3.026383      0.834455
std        4.491026      3.056146      1.956287      0.658155      1.602910      0.932878     10.023122      0.284409
min        0.034120      0.000383      0.000112      0.000839      0.048550      0.100003    -50.341116      0.280000
25%        0.344533      0.228764      0.566531      0.155102      1.463102      0.379476     -6.661925      0.600000
50%        0.693704      0.713193      1.606062      0.460673      2.086361      0.730599     -2.334339      0.800000
75%        1.619198      1.790019      2.705824      0.903497      2.905308      1.275189      1.273494      1.040000
max       74.699990     40.539430     27.995832      7.809622     22.693728     11.762206     51.561442      4.540000
Size of dataset:
 train shape...  (48000, 8) (48000,)
 test shape....  (12000, 8) (12000,)
Traceback (most recent call last):
  File "C:\anaconda3\lib\site-packages\tpot\base.py", line 496, in _add_operators
    operator = next(
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "tpot-NN-rocket-classify.py", line 84, in <module>
    Main(10, 10)
  File "tpot-NN-rocket-classify.py", line 70, in Main
    clf.fit(X_train, y_train)
  File "C:\anaconda3\lib\site-packages\tpot\base.py", line 725, in fit
    self._fit_init()
  File "C:\anaconda3\lib\site-packages\tpot\base.py", line 618, in _fit_init
    self._setup_pset()
  File "C:\anaconda3\lib\site-packages\tpot\base.py", line 437, in _setup_pset
    self._add_operators()
  File "C:\anaconda3\lib\site-packages\tpot\base.py", line 500, in _add_operators
    raise ValueError(
ValueError: An error occured while attempting to read the specified template. Please check a step named PytorchLRClassifier

H:\HedgeTools\ML_Model_Generation\TPOT-NN>pause
Press any key to continue . . .
CBrauer commented 2 years ago

I suppose you meant to delete the first two lines.
If I run:


def Main(g, p):
    X_train, y_train, X_test, y_test = LoadData()

    # clf = TPOTClassifier(config_dict='TPOT NN',
    #                      template='Selector-Transformer-PytorchLRClassifier',

    clf = TPOTClassifier(verbosity=2,
                         generations=g,
                         population_size=p,
                         random_state=7)
    clf.fit(X_train, y_train)
    print(clf.score(X_test, y_test))
    clf.export('tpot_nn_demo_pipeline.py')

I get the following results:

H:\HedgeTools\ML_Model_Generation\TPOT-NN>python tpot-NN-rocket-classify.py
Operating system version.... Windows-10-10.0.22000-SP0
Python version is........... 3.8.13
pandas version is........... 1.4.2
numpy version is............ 1.21.5
tpot version is............. 0.11.7
           BoxRatio        Thrust  Acceleration      Velocity      OnBalRun      vwapGain        Expect          Trin
count  60000.000000  60000.000000  60000.000000  60000.000000  60000.000000  60000.000000  60000.000000  60000.000000
mean       2.061707      1.677448      1.935544      0.635225      2.412940      0.984372     -3.026383      0.834455
std        4.491026      3.056146      1.956287      0.658155      1.602910      0.932878     10.023122      0.284409
min        0.034120      0.000383      0.000112      0.000839      0.048550      0.100003    -50.341116      0.280000
25%        0.344533      0.228764      0.566531      0.155102      1.463102      0.379476     -6.661925      0.600000
50%        0.693704      0.713193      1.606062      0.460673      2.086361      0.730599     -2.334339      0.800000
75%        1.619198      1.790019      2.705824      0.903497      2.905308      1.275189      1.273494      1.040000
max       74.699990     40.539430     27.995832      7.809622     22.693728     11.762206     51.561442      4.540000
Size of dataset:
 train shape...  (48000, 8) (48000,)
 test shape....  (12000, 8) (12000,)

Generation 1 - Current best internal CV score: 0.9716458333333333

Generation 2 - Current best internal CV score: 0.9843125

Generation 3 - Current best internal CV score: 0.9847291666666667

Generation 4 - Current best internal CV score: 0.9859375

Generation 5 - Current best internal CV score: 0.9871041666666667

Generation 6 - Current best internal CV score: 0.9876875

Generation 7 - Current best internal CV score: 0.9876875

Generation 8 - Current best internal CV score: 0.9892291666666667

Generation 9 - Current best internal CV score: 0.9909791666666667

Generation 10 - Current best internal CV score: 0.9909791666666667

Best pipeline: KNeighborsClassifier(DecisionTreeClassifier(RandomForestClassifier(RFE(CombineDFs(input_matrix, input_matrix), criterion=gini, max_features=0.6500000000000001, n_estimators=100, step=0.1), bootstrap=False, criterion=gini, max_features=0.1, min_samples_leaf=3, min_samples_split=20, n_estimators=100), criterion=gini, max_depth=2, min_samples_leaf=9, min_samples_split=9), n_neighbors=6, p=2, weights=distance)
0.9926666666666667

Total compute time was: 01:23:05

I've never had good results with neural networks anyway. And yes, I've tried TabNet. TPOT beats TabNet every time. Charles

JDRomano2 commented 2 years ago

It seems to be an issue when templates are used in conjunction with config_dict='TPOT NN'. When I run your code without a template it runs fine, and the error persists when I swap out your data for a different dataset.

I'll need to do some digging to figure out exactly what is going on, but there seem to be 2 possible contributing factors:

CBrauer commented 2 years ago

Hey,

Thanks for the update.

Charles

From: Joe Romano @.> Sent: Saturday, April 30, 2022 4:46 PM To: EpistasisLab/tpot @.> Cc: Charles Brauer @.>; Author @.> Subject: Re: [EpistasisLab/tpot] My dataet crashed TOP-NN (Issue #1247)

It seems to be an issue when templates are used in conjunction with config_dict='TPOT NN'. When I run your code without a template it runs fine, and the error persists when I swap out your data for a different dataset.

I'll need to do some digging to figure out exactly what is going on, but there seem to be 2 possible contributing factors:

— Reply to this email directly, view it on GitHub https://github.com/EpistasisLab/tpot/issues/1247#issuecomment-1114073334 , or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKBS4REXJNF65LQF7J545TVHXA3BANCNFSM5UTZPZOQ . You are receiving this because you authored the thread. https://github.com/notifications/beacon/AAKBS4XJRAHY6WEP4Y5PUMDVHXA3BA5CNFSM5UTZPZO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOIJTWR5Q.gif Message ID: @. @.> >