remove one of `--validation-dataset` and `--test-dataset` in CLI

LittleLittleCloud commented 4 years ago

It's really confusing to have both options while only --validation-dataset will be used during training and --test-dataset is only for codegen. We should only keep one of them.

related issue: #687

justinormont commented 4 years ago

The way to handle this is to print the test dataset metrics for the one chosen pipeline's model. I'd recommend printing at the sweep summary screen of the CLI.

The validation dataset and test dataset have very specific uses and can not be combined in AutoML. Three datasets are required for proper use of AutoML (and in ML in general).

Dataset usage:

Training - Model is trained on this dataset
Validation - During sweeping, the models trained are scored against this dataset and the final pipeline is chosen using these metrics. If viewed by the user, the validation metrics should be treated only as an update during the sweeping process and not the final metrics. The validation metrics are guaranteed to overfit during the sweeping process, giving an over estimate of the model's performance in production.
Test - This holdout dataset gives the final estimate of the model's performance when launched in production. The test set metrics are more important than the validation metrics.

See: https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

Additional background: https://github.com/dotnet/machinelearning/issues/5070#issuecomment-621389067

JakeRadMSFT commented 4 years ago

@justinormont @LittleLittleCloud

Can we always create the Validation or Tests datasets from the Train dataset if not provided ... and then use them as Justin described?

justinormont commented 4 years ago

If the user doesn't provide the datasets, we can assume it's IID and safe to split. AutoML does this to create a validation dataset if missing (either using CV or TrainValidate split).

If a user has a pre-split train/test or train/valid/test we should use them. Note how many dataset, e.g. Kaggle, are provided with a train/valid/test (or train/test).

-- excessive background continues --

Why does one need train/test or train/valid/test: When creating a dataset, if it is time dependent, you'll want to split the dataset into train/valid/test by keeping the oldest data in the training, newer in the validation, and newest in the test. Otherwise, you are training on future data to predict the past, over estimating how well the model would do in production. As a specific example, predicting the GitHub tags for dotnet/CoreFx issues, the use of different tags vary over time, and new ones are created. If you randomly split the dataset you know the future tags and the files that in the future will receive these tags. When split on time, the model's metrics are representative of how well the model will do when launched in production.

Another style is group leakage where you need all parts of a grouping within the same split of the dataset, for instance Andrew Ng's group was classifying chest x-rays images. The paper used random splitting instead of ensuring that all images of a patient was in the same split. The random splitting put some images of the same patient in the training and in the test. Hence the model partially memorized the patient bone structure, instead of learning to recognize pneumonia in chest x-rays.

Additional background: https://en.wikipedia.org/wiki/Leakage_(machine_learning) https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

Side topic/work-item: AutoML.NET should refit the best found model using all available data before being returned to the user. Model metrics are found from the splits, but the final returned model uses all data. This is done in Azure AutoML.

Related issue: https://github.com/dotnet/machinelearning-automl/issues/361#issuecomment-481834084 (apologies for the private repo link)

rquintino commented 4 years ago

after running a few mlnet automl scenarios today, on some smaller datasets, hard to get a proper test set, wondering if you could explore the test approach I so far know as the dataiku approach :) (don't know any other origin)

just do cv again on the final estimator, but with new cv splits. prob not completely unbiased. but I would say better than just reporting ranked search/validation results which really are optimistic.

more info

"Test: The dataset is split again into K_test random parts, independently from the previous randomization, and with no stratification with respect to the target. The model with the best hyperparameter combination of Step 1 is trained and evaluated on the new test folds in a similar way as previously. The reported performance metrics are averaged across folds."

https://community.dataiku.com/t5/Using-Dataiku-DSS/Nested-cross-validation-and-chosen-best-parameter-from-grid/m-p/2356

thx RQ

ps-slight clarification, for example in dataiku this is done per each "algorithm" type, trying to reduce the multiple test/comparison optmistic issue (more hyper params=more tests=optimistic).
would also say this applies to iid datasets. non iid like explained above requires custom validation/test set work.

beccamc commented 3 years ago

Validation dataset has been removed from CLI.

dotnet / machinelearning-modelbuilder

remove one of `--validation-dataset` and `--test-dataset` in CLI #723