EpistasisLab / tpot

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
http://epistasislab.github.io/tpot/
GNU Lesser General Public License v3.0
9.68k stars 1.57k forks source link

Reproducibility of the export pipeline #1270

Open Iris7788 opened 2 years ago

Iris7788 commented 2 years ago

Context of the issue

I used tpot to fit my dataset, I got the different export pipeline for each run.

Process to reproduce the issue

The steps for generating exported pipeline, the shape of my dataset was (45, 478).

X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1,test_size = 0.15)
M1 = TPOTRegressor(generations=10, population_size=40, verbosity=2, random_state=42,n_jobs =-1,cv=5)
M1.fit(X_train, y_train)
M1.export('M1_pipeline.py')

Current result

  1. When I firstly ran, the export pipeline was DecisionTreeRegressor
    
    Generation 1 - Current best internal CV score: -0.6631261058133652
    Generation 2 - Current best internal CV score: -0.6631261058133652
    Generation 3 - Current best internal CV score: -0.6442071896861652
    Generation 4 - Current best internal CV score: -0.5726875496699182
    Generation 5 - Current best internal CV score: -0.5726875496699182
    Generation 6 - Current best internal CV score: -0.528473933017039
    Generation 7 - Current best internal CV score: -0.528473933017039
    Generation 8 - Current best internal CV score: -0.528473933017039
    Generation 9 - Current best internal CV score: -0.528473933017039
    Generation 10 - Current best internal CV score: -0.528473933017039

Best pipeline: DecisionTreeRegressor(Normalizer(input_matrix, norm=max), max_depth=3, min_samples_leaf=10, min_samples_split=9)

2. When I secondly ran, the export pipeline was ExtraTreesRegressor

Generation 1 - Current best internal CV score: -0.6631261058133652 Generation 2 - Current best internal CV score: -0.6631261058133652 Generation 3 - Current best internal CV score: -0.6593793694494272 Generation 4 - Current best internal CV score: -0.6524528603774085 Generation 5 - Current best internal CV score: -0.636417747633282 Generation 6 - Current best internal CV score: -0.633586381252542 Generation 7 - Current best internal CV score: -0.633586381252542 Generation 8 - Current best internal CV score: -0.633586381252542 Generation 9 - Current best internal CV score: -0.633586381252542 Generation 10 - Current best internal CV score: -0.633586381252542

Best pipeline: ExtraTreesRegressor(LinearSVR(input_matrix, C=1.0, dual=True, epsilon=0.01, loss=epsilon_insensitive, tol=1e-05), bootstrap=False, max_features=0.3, min_samples_leaf=6, min_samples_split=13, n_estimators=100)



## Expected result
I would like to have a repeatable and stable export pipeline. The environment version I am using is Python 3.7.12, TPOT 0.11.7.

Thank you very much for the development and maintenance of TPOT.
perib commented 2 years ago

If you set n_jobs to 1, reproducibility is more likely. When using parallel processes, exact reproducibility gets challenging since the order of execution has some randomness that is not controllable. It is something we are thinking about

Iris7788 commented 2 years ago

你的邮件我已经收到啦,我会尽快查收哒~