martin-plessy commented 4 years ago

After trying out this tutorial using ML.NET CLI, it appears that my model gets to 100% accuracy after 1 iteration:

# MacOS 10.14.6

$ mlnet --version
0.15.28007.4 @BuiltBy: dlab14-DDVSOWINAGE054 @Branch: features/automl @SrcCode: https://github.com/dotnet/machinelearning/tree/dc9a9b7ffcaf636541fe997c59f3bfdda57501e5+dc9a9b7ffcaf636541fe997c59f3bfdda57501e5

$ mlnet auto-train \
    --task "multiclass-classification" \
    --dataset "./Datasets/RestaurantScores.tsv" \
    --label-column-name "RiskCategory" \
    --max-exploration-time 180 \
    --output-path "./Out"

Contents of debug_log.txt:

Inferring Columns ...
Creating Data loader ...
Loading data ...
Exploring multiple ML algorithms and settings to find you the best model for ML task: multiclass-classification
For further learning check: https://aka.ms/mlnet-cli
|     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
[Source=AutoML, Kind=Trace] Channel started
[Source=AutoML, Kind=Trace] Evaluating pipeline xf=ValueToKeyMapping{ col=RiskCategory:RiskCategory} xf=OneHotEncoding{ col=InspectionType:InspectionType col=ViolationDescription:ViolationDescription} xf=ColumnConcatenating{ col=Features:InspectionType,ViolationDescription} xf=Normalizing{ col=Features:Features} tr=AveragedPerceptronOva{} xf=KeyToValueMapping{ col=PredictedLabel:PredictedLabel} cache=+
[Source=AutoML, Kind=Trace] 1   1   00:00:02.0077380    xf=ValueToKeyMapping{ col=RiskCategory:RiskCategory} xf=OneHotEncoding{ col=InspectionType:InspectionType col=ViolationDescription:ViolationDescription} xf=ColumnConcatenating{ col=Features:InspectionType,ViolationDescription} xf=Normalizing{ col=Features:Features} tr=AveragedPerceptronOva{} xf=KeyToValueMapping{ col=PredictedLabel:PredictedLabel} cache=+
|1    AveragedPerceptronOva                       1.0000         1.0000       2.0          0                     |
Retrieving best pipeline ...

===============================================Experiment Results=================================================
------------------------------------------------------------------------------------------------------------------
|                                                     Summary                                                    |
------------------------------------------------------------------------------------------------------------------
|ML Task: multiclass-classification                                                                              |
|Dataset: RestaurantScores.tsv                                                                                   |
|Label : RiskCategory                                                                                            |
|Total experiment time : 180.76 Secs                                                                             |
|Total number of models explored: 1                                                                              |
------------------------------------------------------------------------------------------------------------------
|                                              Top 1 models explored                                             |
------------------------------------------------------------------------------------------------------------------
|     Trainer                              MicroAccuracy  MacroAccuracy  Duration #Iteration                     |
|1    AveragedPerceptronOva                       1.0000         1.0000       2.0          1                     |
------------------------------------------------------------------------------------------------------------------
Generated trained model for consumption: /.../Out/SampleMulticlassClassification/SampleMulticlassClassification.Model/MLModel.zip
Generated C# code for model consumption: /.../Out/SampleMulticlassClassification/SampleMulticlassClassification.ConsoleApp
Check out log file for more information: /.../Out/SampleMulticlassClassification/logs/debug_log.txt

Results: ain't no really 100%, I'm afraid.

Hi => High Risk [0.3133379, 0.3180585, 0.3686036]
Nothing to do here => High Risk [0.3133379, 0.3180585, 0.3686036]
Rats in the cold chambers ! => High Risk [0.3133379, 0.3180585, 0.3686036]
Dead ferret in the fridge => High Risk [0.3133379, 0.3180585, 0.3686036]
Customer sneezing on the waiter => High Risk [0.3133379, 0.3180585, 0.3686036]
Unapprosed equipment => High Risk [0.3133379, 0.3180585, 0.3686036]
Unapproved equipment => High Risk [0.3133379, 0.3180585, 0.3686036]

Did I miss something ?

Document details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: f7fff31d-9b11-e1f1-f4c1-d798a7d9b564
Version Independent ID: 8c94de21-2cc2-fb08-314d-6a62af1f0421
Content: Tutorial: Classify health violations with Model Builder - ML.NET
Content Source: docs/machine-learning/tutorials/health-violation-classification-model-builder.md
Product: dotnet-ml
GitHub Login: @luisquintanilla
Microsoft Alias: luquinta

luisquintanilla commented 4 years ago

Hi @martin-plessy

Thank you for your question. While it's certainly possible to reach 100% on the first try, and there are various factors that may contribute to this, it does not imply that the model will be correct 100% of the time and in fact the 100% may be misleading.

@justinormont thoughts?

justinormont commented 4 years ago

Generally if you're getting 100% accuracy, you're either leaking, have too small of dataset to usefully measure the metrics, or the task is trivial.

In this case the dataset itself is leaking information or is trivial, depending on how you want to look at it. It has duplicates rows. The violation_description column perfectly predicts the label, risk_category, as it's the risk category of the given violation.

There are 53,974 rows but only 363 are unique:

$ wc -l RestaurantScores.tsv
   53974
$ sort RestaurantScores.tsv | uniq | wc -l
     363

See more information about leakage: https://en.wikipedia.org/wiki/Leakage_(machine_learning)

Next steps

I would recommend replacing the dataset. Perhaps with the original dataset (with all columns).

The dataset seems be SF Restaurant Scores (download). In that case, I would pose the problem either as regression or multi-class classification.

Problem styles:

Regression -- predict inspection_score (recommended)
Multiclass classification -- predict violation_description while ignoring risk_category (or the other way around)

For all tasks, I would split the dataset in to train/validate/test based on date (inspection_date column). With the oldest data in the train split, newer in validate, and most recent in test.

Why split on time? This is to avoid leaking data between the dataset splits as this dataset is time-dependent, as the newer information better predicts the other newer rows than does the older rows. For example the inspection_score of a restaurant in June is likely closer to the July score than the January score.

Sample of proposed dataset (with all the columns kept):	business_id	business_name	business_address	business_city	business_state	business_postal_code	business_latitude	business_longitude	business_location	business_phone_number	inspection_id	inspection_date	inspection_score	inspection_type	violation_id	violation_description	risk_category	Neighborhoods (old)	Police Districts	Supervisor Districts	Fire Prevention Districts	Zip Codes
101677	TOASTY	2760 OCTAVIA ST	San Francisco	CA	94123				+14155527781	101677_20190716	07/16/2019 12:00:00 AM		New Ownership
90281	Blue Bottle Coffee	1 Ferry Building `#7`	San Francisco	CA	94111					90281_20170206	02/06/2017 12:00:00 AM		New Ownership	90281_20170206_103144	Unapproved or unmaintained equipment or utensils	Low Risk
69708	BHUK Burger	11 Phelan Ave	San Francisco	CA						69708_20190306	03/06/2019 12:00:00 AM		Non-inspection site visit
587	NEW TSING TAO RESTAURANT	811 ULLOA St	San Francisco	CA	94127	37.740654	-122.465389	POINT (-122.465389 37.740654)	+14155569559	587_20181203	12/03/2018 12:00:00 AM	70	Routine - Unscheduled	587_20181203_103131	Moderate risk vermin infestation	Moderate Risk	40	8	4	1	59	41
615	J & A RESTAURANT	5712 MISSION St	San Francisco	CA	94112	37.709857	-122.449709	POINT (-122.449709 37.709857)	+14155336688	615_20170608	06/08/2017 12:00:00 AM	86	Routine - Unscheduled	615_20170608_103147	Inadequate ventilation or lighting	Low Risk	25	7	6	9	28861	28
75606	RS94109	835 Larkin St	San Francisco	CA	94109					75606_20161129	11/29/2016 12:00:00 AM		New Ownership
2580	CAFE MARS	798 BRANNAN St	San Francisco	CA	94103	37.773174	-122.40311	POINT (-122.40311 37.773174)		2580_20170921	09/21/2017 12:00:00 AM	88	Routine - Unscheduled	2580_20170921_103131	Moderate risk vermin infestation	Moderate Risk	34	2	9	14	28853	34
87015	Morty's Delicatessen	280 Golden Gate Ave	San Francisco	CA	94102					87015_20170630	06/30/2017 12:00:00 AM	77	Routine - Unscheduled	87015_20170630_103119	Inadequate and inaccessible handwashing facilities	Moderate Risk
83196	TJ Cups Inc	2437 Noriega St	San Francisco	CA	94122				+14155682877	83196_20161110	11/10/2016 12:00:00 AM	93	Routine - Unscheduled	83196_20161110_103102	Unclean hands or improper use of gloves	High Risk
4171	SOUTH BEACH YACHT CLUB	PIER 40	San Francisco	CA	94107	37.78162	-122.387677	POINT (-122.387677 37.78162)		4171_20190506	05/06/2019 12:00:00 AM	88	Routine - Unscheduled	4171_20190506_103144	Unapproved or unmaintained equipment or utensils	Low Risk	20	2	9		28856	4
64007	Oakes Children's Center	1550 Treat Ave.	San Francisco	CA	94110	37.745947	-122.412515	POINT (-122.412515 37.745947)	+14155648000	64007_20171026	10/26/2017 12:00:00 AM		Routine - Unscheduled	64007_20171026_103154	Unclean or degraded floors walls or ceilings	Low Risk	2	7	7	2	28859	2
98788	333 Truck	Off The Grid	San Francisco	CA						98788_20190508	05/08/2019 12:00:00 AM		Structural Inspection

luisquintanilla commented 4 years ago

Thanks for that explanation @justinormont

@martin-plessy I hope this answers your question

dotnet / docs

mlnet auto-train gets to 100% accuracy #17721

Document details

Next steps