Code4HR / open-health-inspection-scraper

Scraper for the open-health-inspector app.
Apache License 2.0
7 stars 9 forks source link

Bad data in a corrective action tag breaks scraper #23

Closed bschoenfeld closed 8 years ago

bschoenfeld commented 10 years ago

This report has some bad data in the sixth item. There is a tag in the correction text that shouldn't be there.

get_violations() in scraper.py breaks on this line when trying to process this violation:

'correction': ' '.join([tag.string for tag in details[1].contents if tag.name == 'font']).strip(),
ttavenner commented 10 years ago

That's a tough one since it is just standard text in between brackets. We will probably need to write a function to find and validate anything that looks like a tag against a list of approved tags. If not it can escape the brackets. I have some code in the scraper to fix specific cases but we probably need something more flexible.

ttavenner commented 8 years ago

Marking this as won't fix and closing. Since the page design has completely changed and we are re-writing the scraper from the ground up this will either be worked out naturally or will pop up again in development.