dotnet / docs

This repository contains .NET Documentation.
https://learn.microsoft.com/dotnet
Creative Commons Attribution 4.0 International
4.22k stars 5.87k forks source link

mlnet auto-train gets to 100% accuracy #17721

Closed martin-plessy closed 4 years ago

martin-plessy commented 4 years ago

After trying out this tutorial using ML.NET CLI, it appears that my model gets to 100% accuracy after 1 iteration:

# MacOS 10.14.6

$ mlnet --version
0.15.28007.4 @BuiltBy: dlab14-DDVSOWINAGE054 @Branch: features/automl @SrcCode: https://github.com/dotnet/machinelearning/tree/dc9a9b7ffcaf636541fe997c59f3bfdda57501e5+dc9a9b7ffcaf636541fe997c59f3bfdda57501e5

$ mlnet auto-train \
    --task "multiclass-classification" \
    --dataset "./Datasets/RestaurantScores.tsv" \
    --label-column-name "RiskCategory" \
    --max-exploration-time 180 \
    --output-path "./Out"

Contents of debug_log.txt:

Inferring Columns ...
Creating Data loader ...
Loading data ...
Exploring multiple ML algorithms and settings to find you the best model for ML task: multiclass-classification
For further learning check: https://aka.ms/mlnet-cli
|     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
[Source=AutoML, Kind=Trace] Channel started
[Source=AutoML, Kind=Trace] Evaluating pipeline xf=ValueToKeyMapping{ col=RiskCategory:RiskCategory} xf=OneHotEncoding{ col=InspectionType:InspectionType col=ViolationDescription:ViolationDescription} xf=ColumnConcatenating{ col=Features:InspectionType,ViolationDescription} xf=Normalizing{ col=Features:Features} tr=AveragedPerceptronOva{} xf=KeyToValueMapping{ col=PredictedLabel:PredictedLabel} cache=+
[Source=AutoML, Kind=Trace] 1   1   00:00:02.0077380    xf=ValueToKeyMapping{ col=RiskCategory:RiskCategory} xf=OneHotEncoding{ col=InspectionType:InspectionType col=ViolationDescription:ViolationDescription} xf=ColumnConcatenating{ col=Features:InspectionType,ViolationDescription} xf=Normalizing{ col=Features:Features} tr=AveragedPerceptronOva{} xf=KeyToValueMapping{ col=PredictedLabel:PredictedLabel} cache=+
|1    AveragedPerceptronOva                       1.0000         1.0000       2.0          0                     |
Retrieving best pipeline ...

===============================================Experiment Results=================================================
------------------------------------------------------------------------------------------------------------------
|                                                     Summary                                                    |
------------------------------------------------------------------------------------------------------------------
|ML Task: multiclass-classification                                                                              |
|Dataset: RestaurantScores.tsv                                                                                   |
|Label : RiskCategory                                                                                            |
|Total experiment time : 180.76 Secs                                                                             |
|Total number of models explored: 1                                                                              |
------------------------------------------------------------------------------------------------------------------
|                                              Top 1 models explored                                             |
------------------------------------------------------------------------------------------------------------------
|     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
|1    AveragedPerceptronOva                       1.0000         1.0000       2.0          1                     |
------------------------------------------------------------------------------------------------------------------
Generated trained model for consumption: /.../Out/SampleMulticlassClassification/SampleMulticlassClassification.Model/MLModel.zip
Generated C# code for model consumption: /.../Out/SampleMulticlassClassification/SampleMulticlassClassification.ConsoleApp
Check out log file for more information: /.../Out/SampleMulticlassClassification/logs/debug_log.txt

Results: ain't no really 100%, I'm afraid.

Hi => High Risk [0.3133379, 0.3180585, 0.3686036]
Nothing to do here => High Risk [0.3133379, 0.3180585, 0.3686036]
Rats in the cold chambers ! => High Risk [0.3133379, 0.3180585, 0.3686036]
Dead ferret in the fridge => High Risk [0.3133379, 0.3180585, 0.3686036]
Customer sneezing on the waiter => High Risk [0.3133379, 0.3180585, 0.3686036]
Unapprosed equipment => High Risk [0.3133379, 0.3180585, 0.3686036]
Unapproved equipment => High Risk [0.3133379, 0.3180585, 0.3686036]

Did I miss something ?


Document details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

luisquintanilla commented 4 years ago

Hi @martin-plessy

Thank you for your question. While it's certainly possible to reach 100% on the first try, and there are various factors that may contribute to this, it does not imply that the model will be correct 100% of the time and in fact the 100% may be misleading.

@justinormont thoughts?

justinormont commented 4 years ago

Generally if you're getting 100% accuracy, you're either leaking, have too small of dataset to usefully measure the metrics, or the task is trivial.

In this case the dataset itself is leaking information or is trivial, depending on how you want to look at it. It has duplicates rows. The violation_description column perfectly predicts the label, risk_category, as it's the risk category of the given violation.

There are 53,974 rows but only 363 are unique:

$ wc -l RestaurantScores.tsv
   53974
$ sort RestaurantScores.tsv | uniq | wc -l
     363

See more information about leakage: https://en.wikipedia.org/wiki/Leakage_(machine_learning)

Next steps

I would recommend replacing the dataset. Perhaps with the original dataset (with all columns).

The dataset seems be SF Restaurant Scores (download). In that case, I would pose the problem either as regression or multi-class classification.

Problem styles:

For all tasks, I would split the dataset in to train/validate/test based on date (inspection_date column). With the oldest data in the train split, newer in validate, and most recent in test.

Why split on time? This is to avoid leaking data between the dataset splits as this dataset is time-dependent, as the newer information better predicts the other newer rows than does the older rows. For example the inspection_score of a restaurant in June is likely closer to the July score than the January score.

Sample of proposed dataset (with all the columns kept): business_id business_name business_address business_city business_state business_postal_code business_latitude business_longitude business_location business_phone_number inspection_id inspection_date inspection_score inspection_type violation_id violation_description risk_category Neighborhoods (old) Police Districts Supervisor Districts Fire Prevention Districts Zip Codes Analysis Neighborhoods
101677 TOASTY 2760 OCTAVIA ST San Francisco CA 94123 +14155527781 101677_20190716 07/16/2019 12:00:00 AM New Ownership
90281 Blue Bottle Coffee 1 Ferry Building #7 San Francisco CA 94111 90281_20170206 02/06/2017 12:00:00 AM New Ownership 90281_20170206_103144 Unapproved or unmaintained equipment or utensils Low Risk
69708 BHUK Burger 11 Phelan Ave San Francisco CA 69708_20190306 03/06/2019 12:00:00 AM Non-inspection site visit
587 NEW TSING TAO RESTAURANT 811 ULLOA St San Francisco CA 94127 37.740654 -122.465389 POINT (-122.465389 37.740654) +14155569559 587_20181203 12/03/2018 12:00:00 AM 70 Routine - Unscheduled 587_20181203_103131 Moderate risk vermin infestation Moderate Risk 40 8 4 1 59 41
615 J & A RESTAURANT 5712 MISSION St San Francisco CA 94112 37.709857 -122.449709 POINT (-122.449709 37.709857) +14155336688 615_20170608 06/08/2017 12:00:00 AM 86 Routine - Unscheduled 615_20170608_103147 Inadequate ventilation or lighting Low Risk 25 7 6 9 28861 28
75606 RS94109 835 Larkin St San Francisco CA 94109 75606_20161129 11/29/2016 12:00:00 AM New Ownership
2580 CAFE MARS 798 BRANNAN St San Francisco CA 94103 37.773174 -122.40311 POINT (-122.40311 37.773174) 2580_20170921 09/21/2017 12:00:00 AM 88 Routine - Unscheduled 2580_20170921_103131 Moderate risk vermin infestation Moderate Risk 34 2 9 14 28853 34
87015 Morty's Delicatessen 280 Golden Gate Ave San Francisco CA 94102 87015_20170630 06/30/2017 12:00:00 AM 77 Routine - Unscheduled 87015_20170630_103119 Inadequate and inaccessible handwashing facilities Moderate Risk
83196 TJ Cups Inc 2437 Noriega St San Francisco CA 94122 +14155682877 83196_20161110 11/10/2016 12:00:00 AM 93 Routine - Unscheduled 83196_20161110_103102 Unclean hands or improper use of gloves High Risk
4171 SOUTH BEACH YACHT CLUB PIER 40 San Francisco CA 94107 37.78162 -122.387677 POINT (-122.387677 37.78162) 4171_20190506 05/06/2019 12:00:00 AM 88 Routine - Unscheduled 4171_20190506_103144 Unapproved or unmaintained equipment or utensils Low Risk 20 2 9 28856 4
64007 Oakes Children's Center 1550 Treat Ave. San Francisco CA 94110 37.745947 -122.412515 POINT (-122.412515 37.745947) +14155648000 64007_20171026 10/26/2017 12:00:00 AM Routine - Unscheduled 64007_20171026_103154 Unclean or degraded floors walls or ceilings Low Risk 2 7 7 2 28859 2
98788 333 Truck Off The Grid San Francisco CA 98788_20190508 05/08/2019 12:00:00 AM Structural Inspection
luisquintanilla commented 4 years ago

Thanks for that explanation @justinormont

@martin-plessy I hope this answers your question