EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.72k stars 1.57k forks source link

Error: got an unexpected keyword argument 'max_depth' #458

Closed chivalry123 closed 7 years ago

chivalry123 commented 7 years ago

I am just trying to ru the following code: where x_pca_all_imputed, y_train_log is pandas data frame

tpot = TPOTRegressor(generations=5, population_size=200, verbosity=4,n_jobs=16,)
tpot.fit(x_pca_all_imputed[:len_training], y_train_log)

However, I got this error/warning. the training is running though....

29 operators have been imported by TPOT.
_pre_test decorator: _generate: num_test=0 k should be >=0, <= n_features; got 74.Use k='all' to return all features.
_pre_test decorator: _generate: num_test=1 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 k should be >=0, <= n_features; got 64.Use k='all' to return all features.
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 k should be >=0, <= n_features; got 74.Use k='all' to return all features.
_pre_test decorator: _generate: num_test=1 [21:52:44] src/tree/updater_colmaker.cc:161: Check failed: (n) > (0) colsample_bytree=1 is too small that no feature can be included
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 k should be >=0, <= n_features; got 86.Use k='all' to return all features.
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 k should be >=0, <= n_features; got 19.Use k='all' to return all features.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 k should be >=0, <= n_features; got 61.Use k='all' to return all features.
_pre_test decorator: _generate: num_test=0 k should be >=0, <= n_features; got 92.Use k='all' to return all features.
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=1 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 [21:52:51] src/tree/updater_colmaker.cc:161: Check failed: (n) > (0) colsample_bytree=1 is too small that no feature can be included
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
_pre_test decorator: _generate: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 100
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
rhiever commented 7 years ago

Try changing your code to

tpot = TPOTRegressor(generations=5, population_size=200, verbosity=3,n_jobs=16,)
tpot.fit(x_pca_all_imputed[:len_training].values, y_train_log.values)

Importantly, adding the .values should convert the pandas DataFrames to NumPy matrices. You can learn more about the data representation in scikit-learn (and TPOT) here.

rhiever commented 7 years ago

Hi @chivalry123, did that change work for you?

mglowacki100 commented 7 years ago

I have very similar issue regardless pandas/numpy data (the same messages with verbosity=3). For boston regression example training is running, but I have a dataset when it hangs at first pipeline...

/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
28 operators have been imported by TPOT.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=2 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.

Optimization Progress:   0%|          | 0/20 [00:00<?, ?pipeline/s]
import numpy as np
import pandas as pd
from tpot import TPOTRegressor

train = pd.read_csv('train_A.csv')
test = pd.read_csv('test_A.csv')
y_train = train['y'].values
X_train = train.drop('y', axis=1)
X_train = X_train.values
X_test = test.values
tpot = TPOTRegressor(generations=5, population_size=20, n_jobs=-1, verbosity=3, 
                     max_time_mins=2, max_eval_time_mins=1, random_state=42)

tpot.fit(X_train, y_train)
y_predict = tpot.predict(X_test)
y_predict.to_csv('tpot1_prediction.csv')
#print(tpot.score(X_test, y_test))
print(tpot.score(X_train, y_train))
tpot.export('tpot_merc_pipeline.py')

my_dataset.tar.gz

)

rhiever commented 7 years ago

Does it freeze when you set n_jobs=1?

mglowacki100 commented 7 years ago

Result is slightly different. CPU is still occupied as before, but a few pipelines were optimized before 'freeze'. Note, that limits on max_time_mins=2, max_eval_time_mins=1 are ignored or treated with big margin.

Version 0.8.2 of tpot is outdated. Version 0.8.3 was released 12 minutes ago.
/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
28 operators have been imported by TPOT.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=2 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.

Optimization Progress:   0%|          | 0/20 [00:00<?, ?pipeline/s]
Optimization Progress:   5%|▌         | 1/20 [04:23<1:23:27, 263.56s/pipeline]

Skipped pipeline #4 due to time out. Continuing to the next pipeline.

Optimization Progress:  20%|██        | 4/20 [04:23<1:10:17, 263.56s/pipeline]
Optimization Progress:  25%|██▌       | 5/20 [05:37<47:30, 190.03s/pipeline]  

Skipped pipeline #8 due to time out. Continuing to the next pipeline.

Optimization Progress:  40%|████      | 8/20 [05:37<38:00, 190.03s/pipeline]
Optimization Progress:  45%|████▌     | 9/20 [06:07<24:47, 135.25s/pipeline]
rhiever commented 7 years ago

Sometimes threads freeze and won't even respond to being interrupted. Do you have xgboost installed by chance? That's a common culprit.

mglowacki100 commented 7 years ago

Yes, I've xgboost installed. I've updated xgboost today, I'll check if this solves/mitigate problem.

mglowacki100 commented 7 years ago

With newest xgboost, I've got error as below. I've increased max_time_mins=20, max_eval_time_mins=3 but error appears after 1-2 minutes. So, probably this is a problem with xgboost. Subsequntly I've updated tpot to 0.8.3 but in this case it hangs up after a few pipelines.

unfile('/home/mglowacki/Desktop/Mercedes/tpot_regression_merc.py', wdir='/home/mglowacki/Desktop/Mercedes')
Reloaded modules: xgboost.rabit, xgboost.plotting, xgboost.training, xgboost.callback, xgboost.libpath, xgboost, xgboost.core, xgboost.sklearn, xgboost.compat
Version 0.8.2 of tpot is outdated. Version 0.8.3 was released 1 day ago.
28 operators have been imported by TPOT.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=2 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.

Optimization Progress:   0%|          | 0/20 [00:00<?, ?pipeline/s]

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile
    execfile(filename, namespace)
  File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile
    exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
  File "/home/mglowacki/Desktop/Mercedes/tpot_regression_merc.py", line 24, in <module>
    y_predict = tpot.predict(X_test)
  File "/usr/local/lib/python3.5/dist-packages/tpot/base.py", line 616, in predict
    raise RuntimeError('A pipeline has not yet been optimized. Please call fit() first.')
RuntimeError: A pipeline has not yet been optimized. Please call fit() first.
>>> 
rhiever commented 7 years ago

What's your OS? Also, what's your xgboost version?

import xgboost
print(xgboost.__version__)

I assume this is still with n_jobs=1?

Have you tried running without xgboost installed? That could help narrow down whether it's xgboost causing the issue or not.

mglowacki100 commented 7 years ago
mglowacki100 commented 7 years ago

With xgboost removed, one cpu 100% - more than 20 minutes without either break or pipeline optimization:

runfile('/home/mglowacki/Desktop/Mercedes/tpot_regression_merc.py', wdir='/home/mglowacki/Desktop/Mercedes')
Reloaded modules: __mp_main__
0.8.3
Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
27 operators have been imported by TPOT.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False

Optimization Progress:   0%|          | 0/20 [00:00<?, ?pipeline/s]
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tpot import TPOTRegressor
import tpot
print(tpot.__version__)

train = pd.read_csv('train_A.csv')
test = pd.read_csv('test_A.csv')
y_train = train['y'].values
X_train = train.drop('y', axis=1)
X_train = X_train.values
X_test = test.values

tpot = TPOTRegressor(generations=5, population_size=20, n_jobs=1, verbosity=3, 
                     max_time_mins=20, max_eval_time_mins=3, random_state=42)

tpot.fit(X_train, y_train)
y_predict = tpot.predict(X_test)
y_predict.to_csv('tpot1_prediction.csv')
#print(tpot.score(X_test, y_test))

print(tpot.score(X_train, y_train))
tpot.export('tpot_merc_pipeline.py')

My data set: my_dataset.tar.gz

rhiever commented 7 years ago

Thank you for posting your code and data. I'm able to reproduce your issue on my end as well. Can you please confirm that the following code---with TPOT using the TPOT light configuration---works on your end?

import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tpot import TPOTRegressor
import tpot
print(tpot.__version__)

train = pd.read_csv('train_A.csv')
test = pd.read_csv('test_A.csv')
y_train = train['y'].values
X_train = train.drop('y', axis=1)
X_train = X_train.values
X_test = test.values

tpot = TPOTRegressor(generations=5, population_size=20, n_jobs=1, verbosity=3, 
                     max_time_mins=20, max_eval_time_mins=3, random_state=42,
                     config_dict='TPOT light')

tpot.fit(X_train, y_train)
y_predict = tpot.predict(X_test)
# this line should raise an error because y_predict doesn't have a to_csv function
y_predict.to_csv('tpot1_prediction.csv') 
#print(tpot.score(X_test, y_test))

print(tpot.score(X_train, y_train))
tpot.export('tpot_merc_pipeline.py')
mglowacki100 commented 7 years ago

Yes, it works with 'TPOT light' config. I've noticed mistake on my side, not relavant to this issue: y_predict.to_csv('tpot1_prediction.csv') , it should be np.savetxt("tpot1_prediction.csv", y_predict, delimiter=",")

rhiever commented 7 years ago

OK. So it seems that certain (or just one) sklearn operator freezes on your dataset, and they must be operators that aren't in the TPOT light configuration. Need to see what operators are in the default config The first culprits I suspect are:

Do all of those work on your dataset?

mglowacki100 commented 7 years ago

I've checked them and they all seems to work (a few pipelines opitmized) with preprocessors disabled. I suspect that polynomial features are cause of problem (initial dataset has about 300 feautere so, with polynomials we got additional 4500 feauteres, btw. we have only 4209 datpoints, so this could be also a culprit). With polynomials commented, even n_jobs=-1 doesn't seem to be a problem.

    'sklearn.preprocessing.PolynomialFeatures': {
        'degree': [2],
        'include_bias': [False],
        'interaction_only': [False]
    },

Btw. warnings about max-depth genarates Adaboost.

rhiever commented 7 years ago

That makes sense. When you fit PolynomialFeatures, does it use up all of your system's memory?

mglowacki100 commented 7 years ago

No, it uses about 15gb of 64. I've decreased number of features to 20 and it runs smoothly. Most of my features has a lot of 0 and 1, it would be nice to have a factorization machine as operator.

rhiever commented 7 years ago

Closing this issue for now. Please feel free to re-open if you have any more questions or comments.

jonathanng commented 7 years ago

I'm running xgboost version 0.6, and I'm getting a bunch of error message.

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
    train_df.ix[:, train_df.columns != 'logerror'],
    train_df['logerror'],
    train_size  = 0.75,
    test_size   = 0.25
)

pipeline_optimizer = tpot.TPOTRegressor(
    n_jobs        = -1,
    max_time_mins = 60 * 1,
    warm_start    = True,
    verbosity     = 100
)

pipeline_optimizer.fit(X_train.values, y_train.values)
pipeline_optimizer.export('tpot_exported_pipeline.py')
print(pipeline_optimizer.score(X_test, y_test))
28 operators have been imported by TPOT.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 l2 was provided as affinity. Ward can only work with euclidean distances.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 Automatic alpha grid generation is not supported for l1_ratio=0. Please supply a grid by providing your estimator with the appropriate `alphas=` argument.
_pre_test decorator: _generate: num_test=2 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=3 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 98
_pre_test decorator: _generate: num_test=0 X contains negative values.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
_pre_test decorator: _generate: num_test=1 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 manhattan was provided as affinity. Ward can only work with euclidean distances.
_pre_test decorator: _generate: num_test=2 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 89
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 precomputed was provided as affinity. Ward can only work with euclidean distances.

Should I be alarmed?

rhiever commented 7 years ago

No, you shouldn't be concerned about that output. That's TPOT pre-testing pipelines and throwing out bad pipelines before fully evaluating them. I generally recommend using verbosity=2, as verbosity=3 is going to have a ton of extra output that probably won't be useful for you (except for the Pareto front scores, maybe).