Retrain same data results in different accuracy

Balu2 commented 4 years ago

System information

Windows 10
.NET Core 3.1

Issue: When I retrain the data with the ML.Net Model Builder with the same data and training parameters, there is a difference in Micro- and Macro-Accuracy.

**What did you do? Using the Model Builder to train a data set
What happened? When I retrain the same data (by pressing the Start training button again) with the same label, features and training time, I see a difference in accuracy. Is there a reason why this is?
What did you expect? I expected more or less the same accuracy because all data and parameters are identical.

Source code / logs

First training:

| Top 5 models explored |

| Trainer MicroAccuracy MacroAccuracy Duration #Iteration | |1 FastTreeOva 0.8575 0.7943 23.3 1 | |2 LightGbmMulti 0.8568 0.8002 3.8 2 | |3 FastTreeOva 0.8538 0.7843 27.0 3 | |4 FastForestOva 0.8513 0.7889 26.5 4 | |5 LightGbmMulti 0.8511 0.7808 6.4 5 |

Second training

| Top 5 models explored |

| Trainer MicroAccuracy MacroAccuracy Duration #Iteration | |1 FastTreeOva 0.9016 0.7241 44.8 1 | |2 FastForestOva 0.8847 0.6465 42.9 2 | |3 AveragedPerceptronOva 0.8575 0.6175 13.3 3 | |4 SymbolicSgdLogisticRegressionOva 0.8387 0.4301 7.1 4 | |5 SdcaMaximumEntropyMulti 0.8331 0.3212 5.1 5 |

justinormont commented 4 years ago

This is expected. You'll get different models from run-to-run.

You can get slightly more deterministic runs using the AutoML API directly and setting the MLContext seed. This will cause the train/validate dataset split to be deterministic, though there are still non-deterministic elements like pulling random numbers in a multi-threaded trainer. The non-deterministic behavior will get amplified over time in the model sweeping ensues as a slightly different accuracy will be fed into the SMAC sweeper (Bayesian style) hyper parameter optimizer.

justinormont commented 4 years ago

Side note for Model Builder devs: It seems the iteration count is not displaying the model's iteration, but instead showing the same rank order as on the left. See #Iteration on the right side of the output above.

It should be displayed as: (to indicate the order the pipeline was tried)

------------------------------------------------------------------------------------------------------------------
|     Trainer                              Accuracy      AUC    AUPRC  F1-score  Duration #Iteration             |
|1    SdcaLogisticRegressionBinary           0.8842   0.9690   0.9714    0.8817       2.9          2             |
|2    SdcaLogisticRegressionBinary           0.8817   0.9493   0.9627    0.8932      15.0         11             |
|3    SgdCalibratedBinary                    0.8810   0.9582   0.9724    0.8889       2.5          9             |
|4    SgdCalibratedBinary                    0.8810   0.9577   0.9720    0.8889       2.4         15             |
|5    AveragedPerceptronBinary               0.8780   0.9540   0.9679    0.8819       3.2          1             |
------------------------------------------------------------------------------------------------------------------

/cc @LittleLittleCloud, @JakeRadMSFT

Balu2 commented 4 years ago

Thanks Justin for your quick answer. When using AutoML API and deterministic setting, would the accuracy also be better than via Model builder?

justinormont commented 4 years ago

@Balu2: When your data is IID, the accuracy should be the same. But if your dataset needs separate train/validate/test datasets, or specific grouping of rows, the API or CLI will give you a better model than Model Builder.

This is most often the case for data which is time dependent, where you'd want newer examples in the test dataset than the training dataset. Or when you'd like another split of the dataset to keep certain groups only in the test set (e.g. train on data from 100 grocery stores, and test on data from another 10 different grocery stores, to have it tell you how well the model generalizes to data from new/unseen grocery stores).

With the API or CLI, you can hand it separate datasets for train/valid/test. And with the API you can also hand it a samplingKeyColumn, which ensures that all rows, containing the same value in the samplingKeyColumn, are kept together in the same split of the dataset (either in train/validate splits or cross-validation).

More info on leakage: https://en.wikipedia.org/wiki/Leakage_(machine_learning)#Training_example_leakage

Balu2 commented 4 years ago

@justinormont: the data is not IDD. Via a pivot view, data of different rows is grouped together in one row. I also created 28 training models because of 28 columns in the defined row that can have data or not (null). The complete dataset consists of about 130.000 of these grouped rows, where 128.000 rows are used as dataset and 2.000 rows are used a test set.

I will try the API to see what results it give for the different models and let you know.

antoniovs1029 commented 4 years ago

Hi. Since your original problem of having different accuracies with ModelBuilder has already been addressed by @justinormont 's responses I will close this issue. Please feel free to reopen it if you're still having problems on that specific regard.

dotnet / machinelearning