Open jarlva opened 5 years ago
Hmm, it seems overfitting. subsample
parameter in TPOT API may help.
Hey @jheffez , also what is your cv method? That could have impact.
In addition, I have been using pareto front optimized pipelines and find out in most cases they overfit. Usually i test pipelines that are 2 or 3 levels deep (as compared to 5-6+ levels) and can get better performing pipelines that dont overfit.
@weixuanfu what would you suggest for a subsample ratio?
Hmm, we didn't detect many overfitting cases in our tests before. Could you please provide a demo for reproducing this issue?
Usually I used subsample=0.75
or specify train/test splits via cv
parameter to deal with overfitting issue.
Tell us more about your data. How big is your training dataset and testing dataset (rows, columns)?
My training dataset is:
1 102
0 96
As far as subsample
, since the 0's/1's samples are about even I don't think it's an issue.
I'll try to increase the samples, play with the cv split parameters.
Is there a way to set the tpot scoring/quantifying based on prediction?
What about the features? How many / what kind of features?
15 CV folds is likely too many as well. That means each test fold is ~12 samples. With a dataset that size, I'd say go down to 5 folds in CV.
I reduced CV to 5 and able to get perfect score (all 1's). Yet, prediction is not good.
Is there a way to score/quantify using the prediction score?
Is the data being shuffled? Are all the 0's and 1's grouped together in the dataset?
This smells like a data issue to me.
Yes, I randomize and stratify the train/test data to get an even portions of 0's and 1's. I wish someone can answer the question: Is there a way to score/quantify using the prediction score?
Can you please clarify what you mean by 'score/quantify using the prediction score'? Do you mean use the testing score as the metric for optimization?
Yap, since the train score, in this case, seems to overfit. It would be great to be able to choose which score to use: the train or the prediction.
You shouldn't use the testing score as the optimization criteria, as the algorithm will just overfit to the testing data. That is why the responses in this issue have focused on addressing the underlying issues that would cause the algorithm to overfit on the training data, such as CV scheme and data organization.
What happens if you fit a RandomForestClassifier
to your training data? What's the 5-fold CV score, and what's the score on the testing set?
In a binary classification problem tpot is able to get a perfect score! Both in confusion matrix and classification report (all 1's!). However, prediction test set is around 0.6.. Tried using the following scoring. None help.
`Score = [ 'balanced_accuracy', 'f1', 'f1_macro', 'f1_weighted', 'average_precision']
tpot = TPOTClassifier(generations= 3, population_size=20, verbosity= 2, random_state=42, early_stop=3, scoring= xx, max_time_mins= 10 , n_jobs= cpus, cv=StratifiedKFold(n_splits=15) )`
Any idea what's the cause/how to fix?