dotnet / machinelearning-modelbuilder

Simple UI tool to build custom machine learning models.
Creative Commons Attribution 4.0 International
264 stars 56 forks source link

Training performance is worse with latest version of mlnet CLI/AutoML #154

Closed tbombach closed 5 years ago

tbombach commented 5 years ago

System Information (please complete the following information):

Describe the bug Before the latest update of the mlnet CLI, I was able to consistently get 5 trained models in 10 seconds, with the best one at ~93% accuracy for the wiki-detox-250-line-data.tsv dataset. On the latest mlnet CLI (with the latest AutoML), I only get 2 models in the same amount of time, with about 74% accuracy.

To Reproduce Steps to reproduce the behavior:

  1. Download the wiki-detox-250-line-data dataset
  2. Use the sentiment analysis scenario with 10 seconds as the train time

Expected behavior At least 5 models should be explored, with a top accuracy of about 93%

Actual behavior 2 models explored, with a top accuracy of about 74%

JakeRadMSFT commented 5 years ago

Slow down on small datasets due to cross validation.

I believe @justinormont said it's expected that it might be up to 10x slower but the values mean more.

@justinormont am I remembering correctly? I'll let you add more explanation :)

justinormont commented 5 years ago

Yes, the slower training on small datasets is completely expected with the most recent release.

Previously it was splitting your 250-line training set into training+validation, and training one model. In your dataset, 25 lines (10%) would be used for the validation set. For small datasets this causes us to mainly measure noise. The outcome, on small datasets, is the winning pipeline is mainly determined by the noise instead of actual signal.

In the most recent release, we added automatic cross-validation for small datasets. The cross-validation creates 10 models and we use the average of these 10 models' metrics. The averaging of the cross-validation allows us to measure the pipeline's actual metric with much less noise. The winner, on small datasets, is now chosen with greater certainty.

The thresholds of 10k rows, and 10x cross-validation were chosen from benchmarking across ~70 datasets.

Our current threshold is >10k rows, we use Train/Test; below we use CV. In the longer term, we should be using more than one threshold to set the number of CV folds. In general more CV folds are needed for high label skew, small datasets, scores near 1.0/0.0, and noisy datasets; less CV folds (or none) for larger more tranquil datasets.

You'll also want to increase your training time to explore beyond 5 models. At minimum, 10 to get though the defaults on each trainer, and I'd aim for enough time to finish around 150 models (perhaps 10 minutes for this dataset).