Open dportnoy opened 9 years ago
Created page for full use case specifications and solution: http://hhs.ddod.us/wiki/Use_Case_11
@cornstein Is this data still of interest? If so, could you articulate the value and use of the data so that it can be structured as a use case? Thanks.
@cornstein, we haven't heard back from you in 7 months and will need to close the use case if we don't get more input on, specifically:
Could you respond by Fri 10/23. Thanks!
@cornstein I was able to get an update recently. I spoke with someone from CDC Vessel Sanitation Program which falls under Division of Emergency and Environmental Health Services. Fascinating discussion about the inspection process from beginning to end of the ship construction process and plans to their regular departures. They have a small staff, many of them traveling a majority of the time, so their efforts to produce reports and get them online are maxed out. They don't have the resources to create a dataset from what's currently made available to the public regarding the cruise ship inspection and infection data at this time.
I have not spoke with the team at FDA as yet.
Thanks @DamonDavis. This is a perfect example of where someone at OpenCDC should help them. It's unfair to put the burden on small programs. But CDC and HHS should centrally identify high opportunity data sets and devote the resources to bringing them into the "open data" age. Happy to talk more about this.
This is one of our big challenges, taking the universe of data and trying to prioritize which data sets require attention and resources. Working on it!
@DamonDavis I know!
is there anyway a member of the public can help them? Web scraping?
It's a great question, I think the challenge is both producing a dataset of the existing data, AND providing the capability for them to add future data via an easy data entry platform that produces updated data over time. I'll try to gauge their interests and report back.
Darn! I was so close to getting the data scraped with some Python code until I realized the pagination on the form use Javascript which my current method (https://gist.github.com/marks/0a082fb53475d8fe51aa/148b4434496bc30712395de39175d1bf82c6cb4f) doesnt support :(
So close.
I love that you're trying!!! Thank you tons!
@DamonDavis - no problem. I will find a way :)
+1 @marks Plus this keeps your Mechanize skills sharp!
@DamonDavis @dportnoy - here is some working Python code to get the data: https://gist.github.com/marks/0a082fb53475d8fe51aa
This loops month to month because the ASP form that has to be used to paginate the full results is pretty wonky. I'm on a plane right now with slow internet but will post a full file next week.
No amount of web scraping is better than having the source provide APIs and exports, but at least this is a first step to allow users to even use the data.
@cornstein - curious, how did your team end up getting the data for projects.propublica.org/cruises ?
All - I have finished a first pass at a program to grab data for the 3,569 inspections since 1990 which appear to be responsible for 88,811 line-item deficiencies. I look forward to digging into the data but I wanted to liberate the raw data as soon as possible.
https://github.com/marks/cdc-cruise-ship-inspections
Now comes the fun part... analysis! Feel free to add ideas to my github repo's issues page.
One more thing... I quickly spun up a visualization of the summary data using a Socrata open data tool* -- live link at https://soda.demo.socrata.com/view/asyc-j2sk and screenshot attached below
* full disclosure: I work for Socrata
@marks Thanks! Looks like you got around the pagination challenge by limiting the search by month, right? Hopefully, that will hold for the future as well.
I'd like to experiment with scheduling a daily process to run on healthdata.gov to update this dataset. Would you mind if we adopt some of your code for it? (BTW, I still need to find out from the program owners how far back the data could be changed, so as to avoid unnecessary load.)
@dportnoy Indeed - As long as there are less than 100 inspections per month which seems to be the trend, it should be fine.
You're A-OK to use the code with some attribution (I'll add a license to the repo soon) but I think it would be best to host on data.cdc.gov if it's going to be anywhere official (such as healthdata.gov) as folks go there to look for CDC.gov data.
@marks Sounds good. I'll reach out again to the CDC program owners and get their input on the hosting question. Exciting.
Just curious if you've heard from the CDC program owners. I wonder what other data is "locked" behind query tools that are labor intensive to scrape (and maintain).
@marks Will check into it.
cc: @DamonDavis
The FDA posts cruise ship inspections online (http://wwwn.cdc.gov/InspectionQueryTool/InspectionSearch.aspx) but does not allow the data to be downloaded in a spreadsheet. The same is true for cruise ship outbreak information (http://www.cdc.gov/nceh/vsp/surv/gilist.htm).