Closed shirishr closed 7 years ago
Please check the issue #513, I think the issue is about no random seed in exported pipeline.
During the 5 generations of training & validation (testing), TPOT was reporting average of cross-validation scores from cross_val_score
and the scoring function in cross_val_score
is auc_roc
as the scoring setting in TPOTClassifier
@weixuanfu If CV score: 0.9894097222222221 is an average of cross-validation scores from cross_val_score for the scoring function auc_roc as set in the TPOTClassifier...it is even more impressive. I will run the tpot.export ( 'tpot_dorothea_auc.py' ) with random state =42. Let's see...fingers crossed 😄
@shirishr ignore the last comment. If you cannot reproduce the results, could you please let me know more details about the datasets and codes? I think another possible reason is about using train_test_split
in exported code:
training_features, testing_features, training_target, testing_target = \
train_test_split(features, tpot_data['class'], random_state=42)
If you use all the data for training TPOTClassifier and the exported codes cannot reproduce the results because of using a subset of dataset.
@weixuanfu Here is the challenge: My Optimizing code contains (random_state=42): tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, scoring= 'roc_auc', random_state=42, config_dict='TPOT light') but the the tpot.export ( 'tpot_dorothea_auc.py' ) has a pipeline consisting of KNeighborsClassifier and BernoulliNB. Neither of which accept parameter like random_state Attached are two data files files X.csv & Y.csv (First 800 rows are meant to be for training and last 350 rows are for testing but for train_test_split we could use all 1150 rows) My code script is For_Weixuan.py and the exported script is tpot_dorothea_auc.py All are zipped in together.zip
See what best you can do. Thanks-a-million !!
@weixuanfu My apologies sir, Please add a row at top in Y.csv for column label "class". Once again... thanks a lot
Below are my codes for reproducing the results:
import numpy as np
import pandas as pd
from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
df_X=pd.read_csv('X.csv', index_col=False) df_Y=pd.read_csv('Y.csv', index_col=False)
X_train, X_test, y_train, y_test = train_test_split(np.array(df_X.values), np.array(df_Y.values).ravel(), train_size=800/1150, test_size=350/1150, random_state=42)
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, scoring= 'roc_auc', random_state=42, config_dict='TPOT light') tpot.fit(X_train, y_train) tpot.export('tpot_dorothea_auc_test.py')
print(tpot.score(X_train, y_train))
- Best pipeline: Best pipeline: BernoulliNB(input_matrix, BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=True)
- Codes for reproducing the result:
import numpy as np import pandas as pd from sklearn.naive_bayes import BernoulliNB from sklearn.metrics import roc_auc_score from sklearn.metrics import SCORERS from sklearn.model_selection import train_test_split exported_pipeline = BernoulliNB(alpha=0.1, fit_prior=True)
df_X=pd.read_csv('X.csv', index_col=False) df_Y=pd.read_csv('Y.csv', index_col=False)
X_train, X_test, y_train, y_test = train_test_split(np.array(df_X.values), np.array(df_Y.values).ravel(), train_size=800/1150, test_size=350/1150, random_state=42)
exported_pipeline.fit(X_train, y_train) y_scores = exported_pipeline.predict_proba(X_train)
print('Method 1',roc_auc_score(y_train, y_scores[:,1]))
score_export = SCORERS'roc_auc'
print('Method 2', score_export)
Two key points for reproducing the result:
1. [`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) splits arrays or matrices into **random** train and test subsets
2. [roc_auc_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) only takes target scores (`y_score`) instead of target prediction (`y_pred`)
Awesome... Thank you very much. Both methods worked even without using train_test_split but as:
X_train, X_test, y_train, y_test = np.array(df_X.iloc[0:800].values), np.array(df_X.iloc[800:1150].values), np.array((df_Y.iloc[0:800].values).ravel()), np.array((df_Y.iloc[800:1150].values).ravel())
Thank you so much. The results are really promising
You're welcome. Good to know that you get good result.
I am getting a similar problem, I have train and test data in seperate files, I pass the training data in the tpot fit function. The results of tpot show an roc_auc of 0.79, on running the exported code i get only 0.6.
SEED = 42
np.random.seed(SEED)
train_df = load_data('train.csv')
test_df = load_data('test.csv')
# some preprocessing code
X_train = train_df.drop('target',axis = 1)
Y_train = train_df['target']
X_test = test_df.drop('target',axis = 1)
Y_test = test_df['target']
tpot = TPOTClassifier(verbosity=2,
scoring="roc_auc",
random_state=SEED,
periodic_checkpoint_folder="tpot_ckd_ckpt",
memory='auto',
early_stop=3,
n_jobs=-2,
generations=70,
population_size=100)
tpot.fit(X_train,Y_train)
tpot.export('tpot_ckd_pipeline.py')
score = tpot.score(X_test, Y_test)
print('Score:', score)
SEED = 42
np.random.seed(SEED)
train_df = load_data('train.csv')
test_df = load_data('test.csv')
# same preprocessing code as above
X_train = train_df.drop('target',axis = 1)
Y_train = train_df['target']
X_test = test_df.drop('target',axis = 1)
Y_test = test_df['target']
exported_pipeline = RandomForestClassifier(bootstrap=True, criterion="gini", max_features=0.05, min_samples_leaf=11, min_samples_split=4, n_estimators=100, random_state=SEED)
exported_pipeline.fit(X_train, Y_train)
results = exported_pipeline.predict(X_test)
from sklearn.metrics import roc_auc_score
roc_auc_score(Y_test,results)
gives train auc 0.67, test auc 0.6
Is there something missing? Why do I observe this behavior? I've run tpot multiple times and this behavior is consistent.
The issue is that the random_state in tpot.fittedpipeline was not set to 42 by default. Similar to the issue #513. Add the codes below to Tpot training file should reproduce the results.
tpot.fit(X_train,Y_train)
tpot.export('tpot_ckd_pipeline.py')
tpot._set_param_recursive(tpot.fitted_pipeline_.steps, 'random_state', 42)
# refit with random_state=42
tpot.fitted_pipeline_.fit(X_train, Y_train)
score = tpot.score(X_test, Y_test)
But I’ve set the random state to 42 on my exported pipeline. Shouldn’t it work either way? The issue is only in the exported python file, the score in tpot training file works fine
I thought that score = tpot.score(X_test, Y_test)
in "Tpot training file" was different with roc_auc_score(Y_test,results)
in "Training script on exported pipeline".
Which line in the exported codes for getting the "train auc 0.67"?
If you mean that you got lower test_score in roc_auc_score(Y_test,results)
but higher training score from exported_pipeline.score(X_train,Y_train)
, it should indicate that there is an overfiting issue in the exported model.
Sorry the line is missing. And the training auc was 0.67 and test was 0.6 on the exported pipeline script
About "roc_auc of 0.79" from "Tpot training file", do you mean the score in the log during tpot.fit(X_train,Y_train)
? If so, TPOT was reporting average of cross-validation scores from cross_val_score
.
Shouldn’t a good average CV score give me a decent score after the export? And the drop in score is too high, doesn’t kfold show how well the model generalises?
Sometimes Kfold CV fails to get a generalized model. For example, Kfold may have a bad performance when applying to time series data. You could try other CV iterators.
Also, there is another open related issue #804.
general introduction
I am working with a dataset called Dorothea hosted at UCI. I tried TPOTClassifier with scoring= 'roc_auc' and random state =42 and got results: Generation 1 - Current best internal CV score: 0.9860243055555555 Generation 2 - Current best internal CV score: 0.9860243055555555 Generation 3 - Current best internal CV score: 0.988107638888889 Generation 4 - Current best internal CV score: 0.988107638888889 Generation 5 - Current best internal CV score: 0.9894097222222221 Best pipeline: BernoulliNB(KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=3, KNeighborsClassifierp=2, KNeighborsClassifierweights=distance), BernoulliNBalpha=DEFAULT, BernoulliNBfit_prior=True) 0.989190251572
Context of the issue
These were exciting results so I used the tpot.export file named 'tpot_dorothea_auc.py' Now everything being the same I expected the AUC_ROC to be close to 0.98919 Instead when I print roc_auc_score I get a value 0.807241250931
Issue and relevancy
Why the discrepancy? During the 5 generations of training & validation (testing) is TPOT reporting auc_roc OR the accuracy score? I know in 'tpot_dorothea_auc.py I am explicitly calling out roc_auc_score to be printed.
Any clarification is welcome