Code4HR / open-health-inspection-scraper

Scraper for the open-health-inspector app.
Apache License 2.0
7 stars 9 forks source link

Implement robust duplicate checking #29

Closed wbprice closed 8 years ago

wbprice commented 9 years ago

Sometimes different versions of a single restaurant appear in Healthspaces data and are filtered down into the app. Ex:

screen shot 2014-12-04 at 7 22 34 am

In this case, the information for http://openhealthinspection.com/#/vendor/yummy-wok-811-brandon is a superset of what's contained in http://openhealthinspection.com/#/vendor/yummy-wok-811-brandon-ave. The name is the same but the address is subtly different.

For this case, a solution might be to check if the addresses and names are similar, and then compare the inspections objects against each other to check for an overlap.

qwo commented 9 years ago

So I noticed the entry '811 Brandon' looks like it is the more most up to date entries and the other one is stale data.

Between April 2014 and october someone has relabeled the entry and now the api only updates the '811 Brandon entry

http://www.healthspace.com/Clients/VDH/Norfolk/Norolk_Website.nsf/Food-FacilityHistory?OpenView&RestrictToCategory=278629BC4934F02D852578240071AC4F

ttavenner commented 8 years ago

Paulo discovered that each vendor has a unique ID which is embedded in the URL. We are now pulling this out into a separate field so duplicate vendors shouldn't be an issue.