Closed waldoj closed 10 years ago
All the data is scraped and published on MongoLabs. Ill DM you on twitter with the deets
We do have a route on the API based on the LIVES format that lets you get all the vendors/inspections for a single locality. that is probably the simplest way to get bulk data given that there are 27,000 vendors in the system. But we could certainly publish the files in a downloadable format
Just for the sake of completeness I'm also adding in the option to search the API by city, locality, category, or type so you can whatever list of vendors suites your needs. I do still have a built in limit on the results returned (you can override it with the limit option) just for performance sake.
You might consider dumping all records as a big JSON file, or perhaps one JSON file per locality, and regenerating those each time the scraper runs. That would reduce people banging on the API to generate the data. APIs are great for all kinds of things, but it's easier for a lot of people to work with bulk data. I'm afraid that I don't have any Mongo experience, or else I'd file a pull request to accomplish this. :-/
FWIW, this isn't any kind of immediate need that I have, I just know that it would be useful to some folks. :) No doubt y'all have plenty on your plate, having just launched this great new service and dataset.
definitely this could be an awesome feature
big JSON file, or perhaps one JSON file per locality, and regenerating those each time the scraper runs.
@ttavenner @ryayak1460 @kalmas
I've got this about 90% done. There is a simple command line tool for dumping a MongoDB to JSON or CSV. I just need to expose it via a URL in the API. Will update when its available.
It's not pretty but you can now get bulk files at http://api.openhealthinspection.com/bulk/. Currently just in JSON format. Given the multi-dimensional nature of the data (vendors, inspections, violations), publishing the CSV takes a bit more configuration and probably will require multiple files. I have set this up to publish a new file every week after the scraper finishes running.
Wonderful! Thanks so much for this great addition.
thank you for your feedback @waldoj, please make your rounds with this and give back any inputs like this. we really appreciate you running with it and supporting our effort :).
We also might need your connections(?), I guess to healthspace since you did mention you had talked with them before to bring them to the table if theres any issues between them and us from scraping their site.
I think we are being very mindful but if anything happens we'd love to be collaborators and cooperative.
@stanzheng: Tell me if I can be useful! In addition to Healthspace, also very helpful was Gary Hagy, from the Virginia Department of Health. Between Healthspace and VDH, we set up a great little infrastructure, in which they would sync their database with my (MySQL) database automatically a few times each day. But we could never seal the deal, due to FOIA problems.
interesting. @waldoj do you foresee us running into the same problems you and Gary ran into?
Or what was your goaled objective, we would love to fill that need you guys were trying to find. I am always interested in hearing about the useful cases for the project so we can steer it in the correct direction and make it as useful as possible to community at large.
best,
do you foresee us running into the same problems you and Gary ran into?
Oh, no, since you're not FOIAing the data. :) You can see their CSV format at the repository that I set up while waiting for the data feed that never came.
Or what was your goaled objective, we would love to fill that need you guys were trying to find.
I didn't have one! I simply knew that this is really useful data, and it should be available to people. Period. There were only three specific things that I planned on doing with the data:
I know y'all provide an API, which is really great, but do you also provide (or intended to provide) bulk data? I was about to download the scraper and start running it myself, to publish bulk data, but I don't want to bang on Healthspace's servers unnecessarily, nor do I want to replicate any work that you folks are already doing or intend to do!