EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.74k stars 1.57k forks source link

Perfect score. Bad prediction.. #804

Open jarlva opened 5 years ago

jarlva commented 5 years ago

In a binary classification problem tpot is able to get a perfect score! Both in confusion matrix and classification report (all 1's!). However, prediction test set is around 0.6.. Tried using the following scoring. None help.

`Score = [ 'balanced_accuracy', 'f1', 'f1_macro', 'f1_weighted', 'average_precision']

tpot = TPOTClassifier(generations= 3, population_size=20, verbosity= 2, random_state=42, early_stop=3, scoring= xx, max_time_mins= 10 , n_jobs= cpus, cv=StratifiedKFold(n_splits=15) )`

Any idea what's the cause/how to fix?

weixuanfu commented 5 years ago

Hmm, it seems overfitting. subsample parameter in TPOT API may help.

GinoWoz1 commented 5 years ago

Hey @jheffez , also what is your cv method? That could have impact.

In addition, I have been using pareto front optimized pipelines and find out in most cases they overfit. Usually i test pipelines that are 2 or 3 levels deep (as compared to 5-6+ levels) and can get better performing pipelines that dont overfit.

@weixuanfu what would you suggest for a subsample ratio?

weixuanfu commented 5 years ago

Hmm, we didn't detect many overfitting cases in our tests before. Could you please provide a demo for reproducing this issue?

Usually I used subsample=0.75 or specify train/test splits via cv parameter to deal with overfitting issue.

rhiever commented 5 years ago

Tell us more about your data. How big is your training dataset and testing dataset (rows, columns)?

jarlva commented 5 years ago

My training dataset is: 1 102 0 96 As far as subsample, since the 0's/1's samples are about even I don't think it's an issue. I'll try to increase the samples, play with the cv split parameters. Is there a way to set the tpot scoring/quantifying based on prediction?

rhiever commented 5 years ago

What about the features? How many / what kind of features?

15 CV folds is likely too many as well. That means each test fold is ~12 samples. With a dataset that size, I'd say go down to 5 folds in CV.

jarlva commented 5 years ago

I reduced CV to 5 and able to get perfect score (all 1's). Yet, prediction is not good.

Is there a way to score/quantify using the prediction score?

rhiever commented 5 years ago

Is the data being shuffled? Are all the 0's and 1's grouped together in the dataset?

This smells like a data issue to me.

jarlva commented 5 years ago

Yes, I randomize and stratify the train/test data to get an even portions of 0's and 1's. I wish someone can answer the question: Is there a way to score/quantify using the prediction score?

rhiever commented 5 years ago

Can you please clarify what you mean by 'score/quantify using the prediction score'? Do you mean use the testing score as the metric for optimization?

jarlva commented 5 years ago

Yap, since the train score, in this case, seems to overfit. It would be great to be able to choose which score to use: the train or the prediction.

rhiever commented 5 years ago

You shouldn't use the testing score as the optimization criteria, as the algorithm will just overfit to the testing data. That is why the responses in this issue have focused on addressing the underlying issues that would cause the algorithm to overfit on the training data, such as CV scheme and data organization.

What happens if you fit a RandomForestClassifier to your training data? What's the 5-fold CV score, and what's the score on the testing set?