Closed martin-plessy closed 4 years ago
Hi @martin-plessy
Thank you for your question. While it's certainly possible to reach 100% on the first try, and there are various factors that may contribute to this, it does not imply that the model will be correct 100% of the time and in fact the 100% may be misleading.
@justinormont thoughts?
Generally if you're getting 100% accuracy, you're either leaking, have too small of dataset to usefully measure the metrics, or the task is trivial.
In this case the dataset itself is leaking information or is trivial, depending on how you want to look at it. It has duplicates rows. The violation_description
column perfectly predicts the label, risk_category
, as it's the risk category of the given violation.
There are 53,974 rows but only 363 are unique:
$ wc -l RestaurantScores.tsv
53974
$ sort RestaurantScores.tsv | uniq | wc -l
363
See more information about leakage: https://en.wikipedia.org/wiki/Leakage_(machine_learning)
I would recommend replacing the dataset. Perhaps with the original dataset (with all columns).
The dataset seems be SF Restaurant Scores (download). In that case, I would pose the problem either as regression or multi-class classification.
Problem styles:
inspection_score
(recommended)violation_description
while ignoring risk_category
(or the other way around)For all tasks, I would split the dataset in to train/validate/test based on date (inspection_date
column). With the oldest data in the train split, newer in validate, and most recent in test.
Why split on time?
This is to avoid leaking data between the dataset splits as this dataset is time-dependent, as the newer information better predicts the other newer rows than does the older rows. For example the inspection_score
of a restaurant in June is likely closer to the July score than the January score.
Sample of proposed dataset (with all the columns kept): | business_id | business_name | business_address | business_city | business_state | business_postal_code | business_latitude | business_longitude | business_location | business_phone_number | inspection_id | inspection_date | inspection_score | inspection_type | violation_id | violation_description | risk_category | Neighborhoods (old) | Police Districts | Supervisor Districts | Fire Prevention Districts | Zip Codes | Analysis Neighborhoods |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
101677 | TOASTY | 2760 OCTAVIA ST | San Francisco | CA | 94123 | +14155527781 | 101677_20190716 | 07/16/2019 12:00:00 AM | New Ownership | ||||||||||||||
90281 | Blue Bottle Coffee | 1 Ferry Building #7 |
San Francisco | CA | 94111 | 90281_20170206 | 02/06/2017 12:00:00 AM | New Ownership | 90281_20170206_103144 | Unapproved or unmaintained equipment or utensils | Low Risk | ||||||||||||
69708 | BHUK Burger | 11 Phelan Ave | San Francisco | CA | 69708_20190306 | 03/06/2019 12:00:00 AM | Non-inspection site visit | ||||||||||||||||
587 | NEW TSING TAO RESTAURANT | 811 ULLOA St | San Francisco | CA | 94127 | 37.740654 | -122.465389 | POINT (-122.465389 37.740654) | +14155569559 | 587_20181203 | 12/03/2018 12:00:00 AM | 70 | Routine - Unscheduled | 587_20181203_103131 | Moderate risk vermin infestation | Moderate Risk | 40 | 8 | 4 | 1 | 59 | 41 | |
615 | J & A RESTAURANT | 5712 MISSION St | San Francisco | CA | 94112 | 37.709857 | -122.449709 | POINT (-122.449709 37.709857) | +14155336688 | 615_20170608 | 06/08/2017 12:00:00 AM | 86 | Routine - Unscheduled | 615_20170608_103147 | Inadequate ventilation or lighting | Low Risk | 25 | 7 | 6 | 9 | 28861 | 28 | |
75606 | RS94109 | 835 Larkin St | San Francisco | CA | 94109 | 75606_20161129 | 11/29/2016 12:00:00 AM | New Ownership | |||||||||||||||
2580 | CAFE MARS | 798 BRANNAN St | San Francisco | CA | 94103 | 37.773174 | -122.40311 | POINT (-122.40311 37.773174) | 2580_20170921 | 09/21/2017 12:00:00 AM | 88 | Routine - Unscheduled | 2580_20170921_103131 | Moderate risk vermin infestation | Moderate Risk | 34 | 2 | 9 | 14 | 28853 | 34 | ||
87015 | Morty's Delicatessen | 280 Golden Gate Ave | San Francisco | CA | 94102 | 87015_20170630 | 06/30/2017 12:00:00 AM | 77 | Routine - Unscheduled | 87015_20170630_103119 | Inadequate and inaccessible handwashing facilities | Moderate Risk | |||||||||||
83196 | TJ Cups Inc | 2437 Noriega St | San Francisco | CA | 94122 | +14155682877 | 83196_20161110 | 11/10/2016 12:00:00 AM | 93 | Routine - Unscheduled | 83196_20161110_103102 | Unclean hands or improper use of gloves | High Risk | ||||||||||
4171 | SOUTH BEACH YACHT CLUB | PIER 40 | San Francisco | CA | 94107 | 37.78162 | -122.387677 | POINT (-122.387677 37.78162) | 4171_20190506 | 05/06/2019 12:00:00 AM | 88 | Routine - Unscheduled | 4171_20190506_103144 | Unapproved or unmaintained equipment or utensils | Low Risk | 20 | 2 | 9 | 28856 | 4 | |||
64007 | Oakes Children's Center | 1550 Treat Ave. | San Francisco | CA | 94110 | 37.745947 | -122.412515 | POINT (-122.412515 37.745947) | +14155648000 | 64007_20171026 | 10/26/2017 12:00:00 AM | Routine - Unscheduled | 64007_20171026_103154 | Unclean or degraded floors walls or ceilings | Low Risk | 2 | 7 | 7 | 2 | 28859 | 2 | ||
98788 | 333 Truck | Off The Grid | San Francisco | CA | 98788_20190508 | 05/08/2019 12:00:00 AM | Structural Inspection |
Thanks for that explanation @justinormont
@martin-plessy I hope this answers your question
After trying out this tutorial using ML.NET CLI, it appears that my model gets to 100% accuracy after 1 iteration:
Contents of
debug_log.txt
:Results: ain't no really 100%, I'm afraid.
Did I miss something ?
Document details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.