Reproducibility of the export pipeline

Iris7788 commented 2 years ago

Context of the issue

I used tpot to fit my dataset, I got the different export pipeline for each run.

Process to reproduce the issue

The steps for generating exported pipeline, the shape of my dataset was (45, 478).

X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1,test_size = 0.15)
M1 = TPOTRegressor(generations=10, population_size=40, verbosity=2, random_state=42,n_jobs =-1,cv=5)
M1.fit(X_train, y_train)
M1.export('M1_pipeline.py')

Current result

When I firstly ran, the export pipeline was DecisionTreeRegressor


Generation 1 - Current best internal CV score: -0.6631261058133652
Generation 2 - Current best internal CV score: -0.6631261058133652
Generation 3 - Current best internal CV score: -0.6442071896861652
Generation 4 - Current best internal CV score: -0.5726875496699182
Generation 5 - Current best internal CV score: -0.5726875496699182
Generation 6 - Current best internal CV score: -0.528473933017039
Generation 7 - Current best internal CV score: -0.528473933017039
Generation 8 - Current best internal CV score: -0.528473933017039
Generation 9 - Current best internal CV score: -0.528473933017039
Generation 10 - Current best internal CV score: -0.528473933017039

Best pipeline: DecisionTreeRegressor(Normalizer(input_matrix, norm=max), max_depth=3, min_samples_leaf=10, min_samples_split=9)

2. When I secondly ran, the export pipeline was ExtraTreesRegressor

Generation 1 - Current best internal CV score: -0.6631261058133652 Generation 2 - Current best internal CV score: -0.6631261058133652 Generation 3 - Current best internal CV score: -0.6593793694494272 Generation 4 - Current best internal CV score: -0.6524528603774085 Generation 5 - Current best internal CV score: -0.636417747633282 Generation 6 - Current best internal CV score: -0.633586381252542 Generation 7 - Current best internal CV score: -0.633586381252542 Generation 8 - Current best internal CV score: -0.633586381252542 Generation 9 - Current best internal CV score: -0.633586381252542 Generation 10 - Current best internal CV score: -0.633586381252542

Best pipeline: ExtraTreesRegressor(LinearSVR(input_matrix, C=1.0, dual=True, epsilon=0.01, loss=epsilon_insensitive, tol=1e-05), bootstrap=False, max_features=0.3, min_samples_leaf=6, min_samples_split=13, n_estimators=100)



## Expected result
I would like to have a repeatable and stable export pipeline. The environment version I am using is Python 3.7.12, TPOT 0.11.7.

Thank you very much for the development and maintenance of TPOT.

perib commented 2 years ago

If you set n_jobs to 1, reproducibility is more likely. When using parallel processes, exact reproducibility gets challenging since the order of execution has some randomness that is not controllable. It is something we are thinking about

Iris7788 commented 2 years ago

你的邮件我已经收到啦，我会尽快查收哒~

EpistasisLab / tpot