Code4HR / open-health-inspection-scraper

Scraper for the open-health-inspector app.
Apache License 2.0
7 stars 9 forks source link

Publish bulk data? #16

Closed waldoj closed 10 years ago

waldoj commented 10 years ago

I know y'all provide an API, which is really great, but do you also provide (or intended to provide) bulk data? I was about to download the scraper and start running it myself, to publish bulk data, but I don't want to bang on Healthspace's servers unnecessarily, nor do I want to replicate any work that you folks are already doing or intend to do!

qwo commented 10 years ago

All the data is scraped and published on MongoLabs. Ill DM you on twitter with the deets

**Email

ttavenner commented 10 years ago

We do have a route on the API based on the LIVES format that lets you get all the vendors/inspections for a single locality. that is probably the simplest way to get bulk data given that there are 27,000 vendors in the system. But we could certainly publish the files in a downloadable format

ttavenner commented 10 years ago

Just for the sake of completeness I'm also adding in the option to search the API by city, locality, category, or type so you can whatever list of vendors suites your needs. I do still have a built in limit on the results returned (you can override it with the limit option) just for performance sake.

waldoj commented 10 years ago

You might consider dumping all records as a big JSON file, or perhaps one JSON file per locality, and regenerating those each time the scraper runs. That would reduce people banging on the API to generate the data. APIs are great for all kinds of things, but it's easier for a lot of people to work with bulk data. I'm afraid that I don't have any Mongo experience, or else I'd file a pull request to accomplish this. :-/

FWIW, this isn't any kind of immediate need that I have, I just know that it would be useful to some folks. :) No doubt y'all have plenty on your plate, having just launched this great new service and dataset.

qwo commented 10 years ago

definitely this could be an awesome feature

big JSON file, or perhaps one JSON file per locality, and regenerating those each time the scraper runs.

@ttavenner @ryayak1460 @kalmas

ttavenner commented 10 years ago

I've got this about 90% done. There is a simple command line tool for dumping a MongoDB to JSON or CSV. I just need to expose it via a URL in the API. Will update when its available.

ttavenner commented 10 years ago

It's not pretty but you can now get bulk files at http://api.openhealthinspection.com/bulk/. Currently just in JSON format. Given the multi-dimensional nature of the data (vendors, inspections, violations), publishing the CSV takes a bit more configuration and probably will require multiple files. I have set this up to publish a new file every week after the scraper finishes running.

waldoj commented 10 years ago

Wonderful! Thanks so much for this great addition.

qwo commented 10 years ago

thank you for your feedback @waldoj, please make your rounds with this and give back any inputs like this. we really appreciate you running with it and supporting our effort :).

We also might need your connections(?), I guess to healthspace since you did mention you had talked with them before to bring them to the table if theres any issues between them and us from scraping their site.

I think we are being very mindful but if anything happens we'd love to be collaborators and cooperative.

waldoj commented 10 years ago

@stanzheng: Tell me if I can be useful! In addition to Healthspace, also very helpful was Gary Hagy, from the Virginia Department of Health. Between Healthspace and VDH, we set up a great little infrastructure, in which they would sync their database with my (MySQL) database automatically a few times each day. But we could never seal the deal, due to FOIA problems.

qwo commented 10 years ago

interesting. @waldoj do you foresee us running into the same problems you and Gary ran into?

Or what was your goaled objective, we would love to fill that need you guys were trying to find. I am always interested in hearing about the useful cases for the project so we can steer it in the correct direction and make it as useful as possible to community at large.

best,

waldoj commented 10 years ago

do you foresee us running into the same problems you and Gary ran into?

Oh, no, since you're not FOIAing the data. :) You can see their CSV format at the repository that I set up while waiting for the data feed that never came.

Or what was your goaled objective, we would love to fill that need you guys were trying to find.

I didn't have one! I simply knew that this is really useful data, and it should be available to people. Period. There were only three specific things that I planned on doing with the data:

  1. Making it available to Yelp (or any other entity that wanted it).
  2. Creating a website with inspection data, designed to be indexed by Google, to make the data accessible to people.
  3. Devising a rating system to grade each restaurant, using open, practical, defensible metrics that would provide people with a fair, at-a-glance understanding of how a restaurant has performed.