Closed chivalry123 closed 7 years ago
Try changing your code to
tpot = TPOTRegressor(generations=5, population_size=200, verbosity=3,n_jobs=16,)
tpot.fit(x_pca_all_imputed[:len_training].values, y_train_log.values)
Importantly, adding the .values
should convert the pandas DataFrames to NumPy matrices. You can learn more about the data representation in scikit-learn (and TPOT) here.
Hi @chivalry123, did that change work for you?
I have very similar issue regardless pandas/numpy data (the same messages with verbosity=3). For boston regression example training is running, but I have a dataset when it hangs at first pipeline...
/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
28 operators have been imported by TPOT.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=2 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
Optimization Progress: 0%| | 0/20 [00:00<?, ?pipeline/s]
import numpy as np
import pandas as pd
from tpot import TPOTRegressor
train = pd.read_csv('train_A.csv')
test = pd.read_csv('test_A.csv')
y_train = train['y'].values
X_train = train.drop('y', axis=1)
X_train = X_train.values
X_test = test.values
tpot = TPOTRegressor(generations=5, population_size=20, n_jobs=-1, verbosity=3,
max_time_mins=2, max_eval_time_mins=1, random_state=42)
tpot.fit(X_train, y_train)
y_predict = tpot.predict(X_test)
y_predict.to_csv('tpot1_prediction.csv')
#print(tpot.score(X_test, y_test))
print(tpot.score(X_train, y_train))
tpot.export('tpot_merc_pipeline.py')
)
Does it freeze when you set n_jobs=1
?
Result is slightly different. CPU is still occupied as before, but a few pipelines were optimized before 'freeze'.
Note, that limits on max_time_mins=2, max_eval_time_mins=1
are ignored or treated with big margin.
Version 0.8.2 of tpot is outdated. Version 0.8.3 was released 12 minutes ago.
/usr/local/lib/python3.5/dist-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
"This module will be removed in 0.20.", DeprecationWarning)
28 operators have been imported by TPOT.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=2 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
Optimization Progress: 0%| | 0/20 [00:00<?, ?pipeline/s]
Optimization Progress: 5%|▌ | 1/20 [04:23<1:23:27, 263.56s/pipeline]
Skipped pipeline #4 due to time out. Continuing to the next pipeline.
Optimization Progress: 20%|██ | 4/20 [04:23<1:10:17, 263.56s/pipeline]
Optimization Progress: 25%|██▌ | 5/20 [05:37<47:30, 190.03s/pipeline]
Skipped pipeline #8 due to time out. Continuing to the next pipeline.
Optimization Progress: 40%|████ | 8/20 [05:37<38:00, 190.03s/pipeline]
Optimization Progress: 45%|████▌ | 9/20 [06:07<24:47, 135.25s/pipeline]
Sometimes threads freeze and won't even respond to being interrupted. Do you have xgboost installed by chance? That's a common culprit.
Yes, I've xgboost installed. I've updated xgboost today, I'll check if this solves/mitigate problem.
With newest xgboost, I've got error as below. I've increased max_time_mins=20, max_eval_time_mins=3 but error appears after 1-2 minutes. So, probably this is a problem with xgboost. Subsequntly I've updated tpot to 0.8.3 but in this case it hangs up after a few pipelines.
unfile('/home/mglowacki/Desktop/Mercedes/tpot_regression_merc.py', wdir='/home/mglowacki/Desktop/Mercedes')
Reloaded modules: xgboost.rabit, xgboost.plotting, xgboost.training, xgboost.callback, xgboost.libpath, xgboost, xgboost.core, xgboost.sklearn, xgboost.compat
Version 0.8.2 of tpot is outdated. Version 0.8.3 was released 1 day ago.
28 operators have been imported by TPOT.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=2 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
Optimization Progress: 0%| | 0/20 [00:00<?, ?pipeline/s]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 699, in runfile
execfile(filename, namespace)
File "/usr/lib/python3/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 88, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "/home/mglowacki/Desktop/Mercedes/tpot_regression_merc.py", line 24, in <module>
y_predict = tpot.predict(X_test)
File "/usr/local/lib/python3.5/dist-packages/tpot/base.py", line 616, in predict
raise RuntimeError('A pipeline has not yet been optimized. Please call fit() first.')
RuntimeError: A pipeline has not yet been optimized. Please call fit() first.
>>>
What's your OS? Also, what's your xgboost version?
import xgboost
print(xgboost.__version__)
I assume this is still with n_jobs=1
?
Have you tried running without xgboost installed? That could help narrow down whether it's xgboost causing the issue or not.
With xgboost removed, one cpu 100% - more than 20 minutes without either break or pipeline optimization:
runfile('/home/mglowacki/Desktop/Mercedes/tpot_regression_merc.py', wdir='/home/mglowacki/Desktop/Mercedes')
Reloaded modules: __mp_main__
0.8.3
Warning: xgboost.XGBRegressor is not available and will not be used by TPOT.
27 operators have been imported by TPOT.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
Optimization Progress: 0%| | 0/20 [00:00<?, ?pipeline/s]
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tpot import TPOTRegressor
import tpot
print(tpot.__version__)
train = pd.read_csv('train_A.csv')
test = pd.read_csv('test_A.csv')
y_train = train['y'].values
X_train = train.drop('y', axis=1)
X_train = X_train.values
X_test = test.values
tpot = TPOTRegressor(generations=5, population_size=20, n_jobs=1, verbosity=3,
max_time_mins=20, max_eval_time_mins=3, random_state=42)
tpot.fit(X_train, y_train)
y_predict = tpot.predict(X_test)
y_predict.to_csv('tpot1_prediction.csv')
#print(tpot.score(X_test, y_test))
print(tpot.score(X_train, y_train))
tpot.export('tpot_merc_pipeline.py')
My data set: my_dataset.tar.gz
Thank you for posting your code and data. I'm able to reproduce your issue on my end as well. Can you please confirm that the following code---with TPOT using the TPOT light configuration---works on your end?
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from tpot import TPOTRegressor
import tpot
print(tpot.__version__)
train = pd.read_csv('train_A.csv')
test = pd.read_csv('test_A.csv')
y_train = train['y'].values
X_train = train.drop('y', axis=1)
X_train = X_train.values
X_test = test.values
tpot = TPOTRegressor(generations=5, population_size=20, n_jobs=1, verbosity=3,
max_time_mins=20, max_eval_time_mins=3, random_state=42,
config_dict='TPOT light')
tpot.fit(X_train, y_train)
y_predict = tpot.predict(X_test)
# this line should raise an error because y_predict doesn't have a to_csv function
y_predict.to_csv('tpot1_prediction.csv')
#print(tpot.score(X_test, y_test))
print(tpot.score(X_train, y_train))
tpot.export('tpot_merc_pipeline.py')
Yes, it works with 'TPOT light' config.
I've noticed mistake on my side, not relavant to this issue: y_predict.to_csv('tpot1_prediction.csv')
, it should be np.savetxt("tpot1_prediction.csv", y_predict, delimiter=",")
OK. So it seems that certain (or just one) sklearn operator freezes on your dataset, and they must be operators that aren't in the TPOT light configuration. Need to see what operators are in the default config The first culprits I suspect are:
Do all of those work on your dataset?
I've checked them and they all seems to work (a few pipelines opitmized) with preprocessors disabled. I suspect that polynomial features are cause of problem (initial dataset has about 300 feautere so, with polynomials we got additional 4500 feauteres, btw. we have only 4209 datpoints, so this could be also a culprit). With polynomials commented, even n_jobs=-1 doesn't seem to be a problem.
'sklearn.preprocessing.PolynomialFeatures': {
'degree': [2],
'include_bias': [False],
'interaction_only': [False]
},
Btw. warnings about max-depth genarates Adaboost.
That makes sense. When you fit PolynomialFeatures, does it use up all of your system's memory?
No, it uses about 15gb of 64. I've decreased number of features to 20 and it runs smoothly. Most of my features has a lot of 0 and 1, it would be nice to have a factorization machine as operator.
Closing this issue for now. Please feel free to re-open if you have any more questions or comments.
I'm running xgboost version 0.6, and I'm getting a bunch of error message.
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
train_df.ix[:, train_df.columns != 'logerror'],
train_df['logerror'],
train_size = 0.75,
test_size = 0.25
)
pipeline_optimizer = tpot.TPOTRegressor(
n_jobs = -1,
max_time_mins = 60 * 1,
warm_start = True,
verbosity = 100
)
pipeline_optimizer.fit(X_train.values, y_train.values)
pipeline_optimizer.export('tpot_exported_pipeline.py')
print(pipeline_optimizer.score(X_test, y_test))
28 operators have been imported by TPOT.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 l2 was provided as affinity. Ward can only work with euclidean distances.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 Automatic alpha grid generation is not supported for l1_ratio=0. Please supply a grid by providing your estimator with the appropriate `alphas=` argument.
_pre_test decorator: _generate: num_test=2 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=3 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 Expected n_neighbors <= n_samples, but n_samples = 50, n_neighbors = 98
_pre_test decorator: _generate: num_test=0 X contains negative values.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
_pre_test decorator: _generate: num_test=1 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required.
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=0 __init__() got an unexpected keyword argument 'max_depth'
_pre_test decorator: _generate: num_test=1 manhattan was provided as affinity. Ward can only work with euclidean distances.
_pre_test decorator: _generate: num_test=2 Expected n_neighbors <= n_samples, but n_samples = 50, n_neighbors = 89
_pre_test decorator: _generate: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='epsilon_insensitive' are not supported when dual=False, Parameters: penalty='l2', loss='epsilon_insensitive', dual=False
_pre_test decorator: _generate: num_test=0 precomputed was provided as affinity. Ward can only work with euclidean distances.
Should I be alarmed?
No, you shouldn't be concerned about that output. That's TPOT pre-testing pipelines and throwing out bad pipelines before fully evaluating them. I generally recommend using verbosity=2
, as verbosity=3
is going to have a ton of extra output that probably won't be useful for you (except for the Pareto front scores, maybe).
I am just trying to ru the following code: where x_pca_all_imputed, y_train_log is pandas data frame
However, I got this error/warning. the training is running though....