demand-driven-open-data / ddod-intake

"DDOD Intake" tracks DDOD Use Cases using GitHub issues. View the main DDOD site here
http://ddod.us
28 stars 11 forks source link

CDC - Cruise Ship Outbreaks #11

Open dportnoy opened 9 years ago

cornstein commented 9 years ago

The FDA posts cruise ship inspections online (http://wwwn.cdc.gov/InspectionQueryTool/InspectionSearch.aspx) but does not allow the data to be downloaded in a spreadsheet. The same is true for cruise ship outbreak information (http://www.cdc.gov/nceh/vsp/surv/gilist.htm).

dportnoy commented 9 years ago

Created page for full use case specifications and solution: http://hhs.ddod.us/wiki/Use_Case_11

betshsu commented 9 years ago

@cornstein Is this data still of interest? If so, could you articulate the value and use of the data so that it can be structured as a use case? Thanks.

dportnoy commented 8 years ago

@cornstein, we haven't heard back from you in 7 months and will need to close the use case if we don't get more input on, specifically:

Could you respond by Fri 10/23. Thanks!

DamonDavis commented 8 years ago

@cornstein I was able to get an update recently. I spoke with someone from CDC Vessel Sanitation Program which falls under Division of Emergency and Environmental Health Services. Fascinating discussion about the inspection process from beginning to end of the ship construction process and plans to their regular departures. They have a small staff, many of them traveling a majority of the time, so their efforts to produce reports and get them online are maxed out. They don't have the resources to create a dataset from what's currently made available to the public regarding the cruise ship inspection and infection data at this time.

I have not spoke with the team at FDA as yet.

cornstein commented 8 years ago

Thanks @DamonDavis. This is a perfect example of where someone at OpenCDC should help them. It's unfair to put the burden on small programs. But CDC and HHS should centrally identify high opportunity data sets and devote the resources to bringing them into the "open data" age. Happy to talk more about this.

DamonDavis commented 8 years ago

This is one of our big challenges, taking the universe of data and trying to prioritize which data sets require attention and resources. Working on it!

cornstein commented 8 years ago

@DamonDavis I know!

marks commented 8 years ago

is there anyway a member of the public can help them? Web scraping?

DamonDavis commented 8 years ago

It's a great question, I think the challenge is both producing a dataset of the existing data, AND providing the capability for them to add future data via an easy data entry platform that produces updated data over time. I'll try to gauge their interests and report back.

marks commented 8 years ago

Darn! I was so close to getting the data scraped with some Python code until I realized the pagination on the form use Javascript which my current method (https://gist.github.com/marks/0a082fb53475d8fe51aa/148b4434496bc30712395de39175d1bf82c6cb4f) doesnt support :(

So close.

DamonDavis commented 8 years ago

I love that you're trying!!! Thank you tons!

marks commented 8 years ago

@DamonDavis - no problem. I will find a way :)

dportnoy commented 8 years ago

+1 @marks Plus this keeps your Mechanize skills sharp!

marks commented 8 years ago

@DamonDavis @dportnoy - here is some working Python code to get the data: https://gist.github.com/marks/0a082fb53475d8fe51aa

This loops month to month because the ASP form that has to be used to paginate the full results is pretty wonky. I'm on a plane right now with slow internet but will post a full file next week.

No amount of web scraping is better than having the source provide APIs and exports, but at least this is a first step to allow users to even use the data.

@cornstein - curious, how did your team end up getting the data for projects.propublica.org/cruises ?

marks commented 8 years ago

All - I have finished a first pass at a program to grab data for the 3,569 inspections since 1990 which appear to be responsible for 88,811 line-item deficiencies. I look forward to digging into the data but I wanted to liberate the raw data as soon as possible.

https://github.com/marks/cdc-cruise-ship-inspections

Now comes the fun part... analysis! Feel free to add ideas to my github repo's issues page.

marks commented 8 years ago

One more thing... I quickly spun up a visualization of the summary data using a Socrata open data tool* -- live link at https://soda.demo.socrata.com/view/asyc-j2sk and screenshot attached below

* full disclosure: I work for Socrata

cdc_vsp_cruise_ship_inspections_-_summary_file___socrata

dportnoy commented 8 years ago

@marks Thanks! Looks like you got around the pagination challenge by limiting the search by month, right? Hopefully, that will hold for the future as well.

I'd like to experiment with scheduling a daily process to run on healthdata.gov to update this dataset. Would you mind if we adopt some of your code for it? (BTW, I still need to find out from the program owners how far back the data could be changed, so as to avoid unnecessary load.)

marks commented 8 years ago

@dportnoy Indeed - As long as there are less than 100 inspections per month which seems to be the trend, it should be fine.

You're A-OK to use the code with some attribution (I'll add a license to the repo soon) but I think it would be best to host on data.cdc.gov if it's going to be anywhere official (such as healthdata.gov) as folks go there to look for CDC.gov data.

dportnoy commented 8 years ago

@marks Sounds good. I'll reach out again to the CDC program owners and get their input on the hosting question. Exciting.

marks commented 8 years ago

Just curious if you've heard from the CDC program owners. I wonder what other data is "locked" behind query tools that are labor intensive to scrape (and maintain).

dportnoy commented 8 years ago

@marks Will check into it.

cc: @DamonDavis