EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.74k stars 1.57k forks source link

Can not reproduce results #563

Closed shirishr closed 7 years ago

shirishr commented 7 years ago

general introduction

I am working with a dataset called Dorothea hosted at UCI. I tried TPOTClassifier with scoring= 'roc_auc' and random state =42 and got results: Generation 1 - Current best internal CV score: 0.9860243055555555 Generation 2 - Current best internal CV score: 0.9860243055555555 Generation 3 - Current best internal CV score: 0.988107638888889 Generation 4 - Current best internal CV score: 0.988107638888889 Generation 5 - Current best internal CV score: 0.9894097222222221 Best pipeline: BernoulliNB(KNeighborsClassifier(input_matrix, KNeighborsClassifier__n_neighbors=3, KNeighborsClassifierp=2, KNeighborsClassifierweights=distance), BernoulliNBalpha=DEFAULT, BernoulliNBfit_prior=True) 0.989190251572

Context of the issue

These were exciting results so I used the tpot.export file named 'tpot_dorothea_auc.py' Now everything being the same I expected the AUC_ROC to be close to 0.98919 Instead when I print roc_auc_score I get a value 0.807241250931

Issue and relevancy

Why the discrepancy? During the 5 generations of training & validation (testing) is TPOT reporting auc_roc OR the accuracy score? I know in 'tpot_dorothea_auc.py I am explicitly calling out roc_auc_score to be printed.

Any clarification is welcome

weixuanfu commented 7 years ago

Please check the issue #513, I think the issue is about no random seed in exported pipeline.

During the 5 generations of training & validation (testing), TPOT was reporting average of cross-validation scores from cross_val_score and the scoring function in cross_val_score is auc_roc as the scoring setting in TPOTClassifier

shirishr commented 7 years ago

@weixuanfu If CV score: 0.9894097222222221 is an average of cross-validation scores from cross_val_score for the scoring function auc_roc as set in the TPOTClassifier...it is even more impressive. I will run the tpot.export ( 'tpot_dorothea_auc.py' ) with random state =42. Let's see...fingers crossed 😄

weixuanfu commented 7 years ago

@shirishr ignore the last comment. If you cannot reproduce the results, could you please let me know more details about the datasets and codes? I think another possible reason is about using train_test_split in exported code:

training_features, testing_features, training_target, testing_target = \
    train_test_split(features, tpot_data['class'], random_state=42)

If you use all the data for training TPOTClassifier and the exported codes cannot reproduce the results because of using a subset of dataset.

shirishr commented 7 years ago

@weixuanfu Here is the challenge: My Optimizing code contains (random_state=42): tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, scoring= 'roc_auc', random_state=42, config_dict='TPOT light') but the the tpot.export ( 'tpot_dorothea_auc.py' ) has a pipeline consisting of KNeighborsClassifier and BernoulliNB. Neither of which accept parameter like random_state Attached are two data files files X.csv & Y.csv (First 800 rows are meant to be for training and last 350 rows are for testing but for train_test_split we could use all 1150 rows) My code script is For_Weixuan.py and the exported script is tpot_dorothea_auc.py All are zipped in together.zip

together.zip

See what best you can do. Thanks-a-million !!

shirishr commented 7 years ago

@weixuanfu My apologies sir, Please add a row at top in Y.csv for column label "class". Once again... thanks a lot

weixuanfu commented 7 years ago

Below are my codes for reproducing the results:

df_X=pd.read_csv('X.csv', index_col=False) df_Y=pd.read_csv('Y.csv', index_col=False)

X_train, X_test, y_train, y_test = train_test_split(np.array(df_X.values), np.array(df_Y.values).ravel(), train_size=800/1150, test_size=350/1150, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, scoring= 'roc_auc', random_state=42, config_dict='TPOT light') tpot.fit(X_train, y_train) tpot.export('tpot_dorothea_auc_test.py')

print(tpot.score(X_train, y_train))

- Best pipeline: Best pipeline: BernoulliNB(input_matrix, BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=True)

- Codes for reproducing the result:

import numpy as np import pandas as pd from sklearn.naive_bayes import BernoulliNB from sklearn.metrics import roc_auc_score from sklearn.metrics import SCORERS from sklearn.model_selection import train_test_split exported_pipeline = BernoulliNB(alpha=0.1, fit_prior=True)

df_X=pd.read_csv('X.csv', index_col=False) df_Y=pd.read_csv('Y.csv', index_col=False)

train_test_split can generate the same train/test split with the same random state (42)

X_train, X_test, y_train, y_test = train_test_split(np.array(df_X.values), np.array(df_Y.values).ravel(), train_size=800/1150, test_size=350/1150, random_state=42)

method 1

exported_pipeline.fit(X_train, y_train) y_scores = exported_pipeline.predict_proba(X_train)

print('Method 1',roc_auc_score(y_train, y_scores[:,1]))

method 2

score_export = SCORERS'roc_auc'

print('Method 2', score_export)



Two key points for  reproducing the result:
1. [`train_test_split`](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) splits arrays or matrices into **random** train and test subsets

2. [roc_auc_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html#sklearn.metrics.roc_auc_score) only takes target scores (`y_score`) instead of target prediction (`y_pred`) 
shirishr commented 7 years ago

Awesome... Thank you very much. Both methods worked even without using train_test_split but as:

X_train, X_test, y_train, y_test = np.array(df_X.iloc[0:800].values), np.array(df_X.iloc[800:1150].values), np.array((df_Y.iloc[0:800].values).ravel()), np.array((df_Y.iloc[800:1150].values).ravel())

Thank you so much. The results are really promising

weixuanfu commented 7 years ago

You're welcome. Good to know that you get good result.

SuryaThiru commented 5 years ago

I am getting a similar problem, I have train and test data in seperate files, I pass the training data in the tpot fit function. The results of tpot show an roc_auc of 0.79, on running the exported code i get only 0.6.

Tpot training file

SEED = 42
np.random.seed(SEED)

train_df = load_data('train.csv')
test_df = load_data('test.csv')

# some preprocessing code

X_train = train_df.drop('target',axis = 1)
Y_train = train_df['target']

X_test = test_df.drop('target',axis = 1)
Y_test = test_df['target']

tpot = TPOTClassifier(verbosity=2, 
                      scoring="roc_auc", 
                      random_state=SEED, 
                      periodic_checkpoint_folder="tpot_ckd_ckpt", 
                      memory='auto',
                      early_stop=3,
                      n_jobs=-2,
                      generations=70, 
                      population_size=100)

tpot.fit(X_train,Y_train)
tpot.export('tpot_ckd_pipeline.py')

score = tpot.score(X_test, Y_test)
print('Score:', score)

Training script on exported pipeline

SEED = 42
np.random.seed(SEED)

train_df = load_data('train.csv')
test_df = load_data('test.csv')

# same preprocessing code as above

X_train = train_df.drop('target',axis = 1)
Y_train = train_df['target']

X_test = test_df.drop('target',axis = 1)
Y_test = test_df['target']

exported_pipeline = RandomForestClassifier(bootstrap=True, criterion="gini", max_features=0.05, min_samples_leaf=11, min_samples_split=4, n_estimators=100, random_state=SEED)

exported_pipeline.fit(X_train, Y_train)

results = exported_pipeline.predict(X_test)

from sklearn.metrics import roc_auc_score
roc_auc_score(Y_test,results)

gives train auc 0.67, test auc 0.6

Is there something missing? Why do I observe this behavior? I've run tpot multiple times and this behavior is consistent.

weixuanfu commented 5 years ago

The issue is that the random_state in tpot.fittedpipeline was not set to 42 by default. Similar to the issue #513. Add the codes below to Tpot training file should reproduce the results.

tpot.fit(X_train,Y_train)
tpot.export('tpot_ckd_pipeline.py')

tpot._set_param_recursive(tpot.fitted_pipeline_.steps, 'random_state', 42)
# refit with random_state=42
tpot.fitted_pipeline_.fit(X_train, Y_train)

score = tpot.score(X_test, Y_test)
SuryaThiru commented 5 years ago

But I’ve set the random state to 42 on my exported pipeline. Shouldn’t it work either way? The issue is only in the exported python file, the score in tpot training file works fine

weixuanfu commented 5 years ago

I thought that score = tpot.score(X_test, Y_test) in "Tpot training file" was different with roc_auc_score(Y_test,results) in "Training script on exported pipeline".

Which line in the exported codes for getting the "train auc 0.67"?

If you mean that you got lower test_score in roc_auc_score(Y_test,results) but higher training score from exported_pipeline.score(X_train,Y_train), it should indicate that there is an overfiting issue in the exported model.

SuryaThiru commented 5 years ago

Sorry the line is missing. And the training auc was 0.67 and test was 0.6 on the exported pipeline script

weixuanfu commented 5 years ago

About "roc_auc of 0.79" from "Tpot training file", do you mean the score in the log during tpot.fit(X_train,Y_train)? If so, TPOT was reporting average of cross-validation scores from cross_val_score.

SuryaThiru commented 5 years ago

Shouldn’t a good average CV score give me a decent score after the export? And the drop in score is too high, doesn’t kfold show how well the model generalises?

weixuanfu commented 5 years ago

Sometimes Kfold CV fails to get a generalized model. For example, Kfold may have a bad performance when applying to time series data. You could try other CV iterators.

Also, there is another open related issue #804.