Open dportnoy opened 8 years ago
@marks, @ftrotter, @cornstein, @BAP-Jeff, @loranstefani, @Jordanrau, @HeatherBrotsos, @lloydbrodsky, Since you're all past contributors, I wanted to see if any of you would take the challenge. Feel free to pass it on to your colleagues as well.
In case it's helpful, I started working on the code, but just haven't had enough time to get to it lately. Feel free to use. You can find it here: https://github.com/dportnoy/health-insurance-marketplace-analytics
@dportnoy - looks like you got pretty far.. would you mind documenting what it does so far and what areas need attention? Would definitely help so others dont accidentally reinvent the wheel
@marks You give me entirely too much credit. But point taken. I'll write up a summary and post it.
Translation from JSON schema into tabular should be straight forward...
I put something together really quickly that takes insurance provider's index.json
URLs and flattens the data from the formulary_urls
array.
Sample dataset: https://healthdata.demo.socrata.com/view/979j-m4qb Code: https://github.com/marks/health-insurance-marketplace-analytics/commit/3507b4738fbb1ba75126e474bd6c24f0baf9d133
Really interesting output, but I wonder if there's a way to make this a bit more useful. Perhaps we can have a brief chat?
On Sat, Apr 16, 2016 at 11:44 AM, Mark Silverberg notifications@github.com wrote:
I put something together really quickly that takes insurance provider's index.json URLs and flattens the data from the formulary_urls array.
Sample dataset: https://healthdata.demo.socrata.com/view/979j-m4qb Code: marks/health-insurance-marketplace-analytics@3507b47 https://github.com/marks/health-insurance-marketplace-analytics/commit/3507b4738fbb1ba75126e474bd6c24f0baf9d133
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210842613
@cornstein - definitely... That's just a quick visualization of the data. You can export it (all or a slice) or use the API as well, of course.
Kind of hectic day but happy to jump on a call tomorrow or continue to chat here if you're interested in a specific view
Mark Silverberg (m) 512 826 7004 http://twitter.com/skram
On Apr 16, 2016, at 11:56 AM, cornstein notifications@github.com wrote:
Really interesting output, but I wonder if there's a way to make this a bit more useful. Perhaps we can have a brief chat?
On Sat, Apr 16, 2016 at 11:44 AM, Mark Silverberg notifications@github.com wrote:
I put something together really quickly that takes insurance provider's index.json URLs and flattens the data from the formulary_urls array.
Sample dataset: https://healthdata.demo.socrata.com/view/979j-m4qb Code: marks/health-insurance-marketplace-analytics@3507b47 https://github.com/marks/health-insurance-marketplace-analytics/commit/3507b4738fbb1ba75126e474bd6c24f0baf9d133
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210842613
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
Here are some options for the end product, along with their uses. They are listed in increasing order of difficulty:
I've made a bit more progress. The latest code is available at https://github.com/marks/health-insurance-marketplace-analytics/blob/master/flattener/flatten_from_index.py
I have code that starts* to flatten all three file types: drugs.json
, plans.json
, and providers.json
. I have only run providers.json
and drugs.json
on a few index.json
's as it can take a while to process.
The flattened data can be explored/downloaded/called-with-an-API from the following links:
File Type Flattened | Data Explorer (you can export from here too) |
Data Grid (you can do roll ups and filtering here) |
API Docs |
---|---|---|---|
plans.json |
https://healthdata.demo.socrata.com/view/yjgn-upfm | https://healthdata.demo.socrata.com/d/x6bt-9iym | https://dev.socrata.com/foundry/healthdata.demo.socrata.com/55ch-w329 |
providers.json |
https://healthdata.demo.socrata.com/view/pqgg-pgsy | https://healthdata.demo.socrata.com/d/axbq-xnwy | https://dev.socrata.com/foundry/healthdata.demo.socrata.com/5ht2-73gj |
drugs.json |
https://healthdata.demo.socrata.com/view/979j-m4qb | https://healthdata.demo.socrata.com/d/jaa8-k3k2 | https://dev.socrata.com/foundry/healthdata.demo.socrata.com/6ngr-bz3d |
*
Not all data is flattened at this time to decrease the number of rows. For example, languages
in the provider file are serialized as a JSON array as are addresses
and specialties
@BAP-Jeff, hi again. Welcome to use case #56! To answer your question from issue #52... Having a ready to go dataset would be ideal for the Bayes Hack codathon for next weekend. It's identified as the central dataset for building consumer healthcare apps. You'll see a comment above that identifies 4 possible file types: https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210875780. It's good to have a few people working on this, because some of the aspects (such as file sizes and cleanup for analytics) are challenging. Having the code would be a useful resource for the community to refresh the data in the future.
Hello All,
As David knows, I have been playing with the drugs.json set for a while as I build some commercial applications for the formulary space. There are a ton of challenges in using the data as supplied but overall it is a good first year effort by the agency.
If I understand the challenges David laid out, I have been pursuing Option
list of URLs from the Machine Readable PUF, then retrieve the drugs.json files for each (2) then I run a second script to rip apart the json documents into a flat pipe delimited file.
I am happy to share the code and/or the end result (warning around 35 to 40 million records). What do we want for the hack-a-thon? Have to admit I have not participated in one before, so I am not sure if people are only interested in the end result (upload the dataset to say Socrata or kaggle) or do they want to understand the process, so they can roll their own?
Let me know which way the group wants to go.
Jeff
On Sat, Apr 16, 2016 at 11:46 PM, David X Portnoy notifications@github.com wrote:
@BAP-Jeff https://github.com/BAP-Jeff, hi again. Welcome to use case #56 https://github.com/demand-driven-open-data/ddod-intake/issues/56! To answer your question from issue #52 https://github.com/demand-driven-open-data/ddod-intake/issues/52... Having a ready to go dataset would be ideal for the Bayes Hack codathon for next weekend. It's identified as the central dataset for building consumer healthcare apps. You'll see a comment above that identifies 4 possible file types: #56 (comment) https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210875780. It's good to have a few people working on this, because some of the aspects (such as file sizes and cleanup for analytics) are challenging. Having the code would be a useful resource for the community to refresh the data in the future.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210948520
@BAP-Jeff - awesome. I think both the data and the provenance (how you got it, including any scripts/documentation) would be much appreciated by participants as well as the community in general. Would be happy to look into hosting it onto Socrata if you can provide the raw data
Mark
Okay thanks. I will put together some things tonight. Much of my time in working with this dataset is dealing with the data discrepancies (i.e. multiple statuses on a drug for a plan, use of NDCs vs RxCUIs, missing restrictions, bogus URLs), but showing off the millions of records that are produced will be cool. Plus hopefully making it more accessible will get more eyes on it...and we can get those things corrected.
Here is the "process" I would propose that we show off (for the drugs.json branch):
David, do you need a "deck" or some sort of visual to present? Help me out if I am going astray...
Jeff
On Sun, Apr 17, 2016 at 10:08 AM, Mark Silverberg notifications@github.com wrote:
@BAP-Jeff https://github.com/BAP-Jeff - awesome. I think both the data and the provenance (how you got it, including any scripts/documentation) would be much appreciated by participants as well as the community in general. Would be happy to look into hosting it onto Socrata if you can provide the raw data
Mark
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211026131
I dont want to make more work for you, @BAP-Jeff, but I'd be very interested in what your CSVs are looking like. I am running some code on the index.json
s I found in a previously shared xlsx
to flatten all three file types (but they may not have all important info... I am not a SME in this area
I actually use pipe delimiting but can easily change it out. I am attaching an example source json file and what I am outputting (tried to find a small one). It is very close to what you had shown off before Mark. I can easily add some more columns to this from the Plan Attributes PUF file to be more complete (IssuerID, Region, etc), but sounded like you wanted to see what the current state was. I also included the python script I used to parse the file.
The format of the output is:
@BAP-Jeff - Yup! Looks quite similar. I would think it would be very helpful to folks to include info from the Plan Attributed PUF. Will standby for files so that we can get them on Socrata* (and Kaggle and other platforms) for dissemination/exploration/API acces
*
Full disclosure: I work for Socrata and we're more than happy to help folks download the data in a number of formats. Can accept pipe delimited, no big deal.
Okay I will be gone most of the day but will get on this later today.
On Apr 17, 2016, at 11:04 AM, Mark Silverberg notifications@github.com wrote:
@BAP-Jeff - Yup! Looks quite similar. I would think it would be very helpful to folks to include info from the Plan Attributed PUF. Will standby for files so that we can get them on Socrata* for dissemination/exploration/API acces
- Full disclosure: I work for Socrata and we're more than happy to help folks download the data in a number of formats. Can accept pipe delimited, no big deal.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
@BAP-Jeff, great to have you contribute! Let's see if I could answer most of your questions...
David,
I think I can take on all the above but may need a little help with the Deck. Do you have a template you want to use - Powerpoint, Google, ??
I can produce the files tonight/tomorrow and upload them probably tomorrow.
@dportnoy give me the details on the kaggle stuff @marks let me know where to drop the files for Socrata
David, did you want to reserve any slides for ..."Issues we found in the data" or does that diminish the good vibes?
Jeff
Jeff- if I can just get delimited files, I can take care of loading it for ya. Sounds like kaggle is top priority for this event though ;)
Mark Silverberg (m) 512 826 7004 http://twitter.com/skram
On Apr 17, 2016, at 7:18 PM, BAP-Jeff notifications@github.com wrote:
David,
I think I can take on all the above but may need a little help with the Deck. Do you have a template you want to use - Powerpoint, Google, ??
I can produce the files tonight/tomorrow and upload them probably tomorrow.
@dportnoy give me the details on the kaggle stuff @marks let me know where to drop the files for Socrata
David, did you want to reserve any slides for ..."Issues we found in the data" or does that diminish the good vibes?
Jeff
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
Okay I will loop back around to you as time is short.
On Sun, Apr 17, 2016 at 7:28 PM, Mark Silverberg notifications@github.com wrote:
Jeff- if I can just get delimited files, I can take care of loading it for ya. Sounds like kaggle is top priority for this event though ;)
Mark Silverberg (m) 512 826 7004 http://twitter.com/skram
On Apr 17, 2016, at 7:18 PM, BAP-Jeff notifications@github.com wrote:
David,
I think I can take on all the above but may need a little help with the Deck. Do you have a template you want to use - Powerpoint, Google, ??
I can produce the files tonight/tomorrow and upload them probably tomorrow.
@dportnoy give me the details on the kaggle stuff @marks let me know where to drop the files for Socrata
David, did you want to reserve any slides for ..."Issues we found in the data" or does that diminish the good vibes?
Jeff
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211128096
@BAP-Jeff Google slides. No specific template. Just the white default.
For Kaggle, upload to it will be done manually by their staff (since they're still working on making that functionality self-serve). So we need a staging location before the data gets copied there.
Okay I have the file ready to go. It is about 5GBs uncompressed (31 million records). The file has the following table def:
create table Hackathon (StateCode varchar(2), IssuerID char(5), StandardComponentID varchar(14), MarketCoverage varchar(18), PlanMarketingName varchar(142), HIOSProductID varchar(10), ServiceAreaID varchar(6), FormularyID varchar(6), PlanType varchar(9), MetalLevel varchar(12), DrugStatus varchar(100), RxCUI varchar(15), DrugName varchar(2000), PAStatus char(1), STStatus char(1), QLStatus char(1))
I pulled from Plan Attributes what I thought was the key information. Note: PlanID in the Formulary Files is not the same as PlanID in Plan Attributes. The analogous column is StandardComponentID, furthermore I ignored the CSR variations in Plan Attributes and joined the distinct records from Plan Attributes to the Formulary Files. If this didn't make sense...well I guess trust me...
I have added a header record to the file and am now waiting for the zipping to finish up. I assume it will get down to the 2GB range. I can throw it up into a S3 bucket and share it I guess, or GDrive. Let me know. Could be a few days on the slides....
Jeff
@BAP-Jeff, definitely throw it up on either S3 or gDrive, whichever works best! Ideally, you could also provide a smaller file that has a more manageable sample subset of this data, so that it can be used to test code.
@ftrotter, wanted to ping you and see if you're still working on any aspect of this.
Actually it zipped down to 250 megs so not such a monster. I also did a sample file for all of Humana in Georgia.
@BAP-Jeff - I think you missed a pipe in the header row
Thank goodness someone is QAing this! Thanks @marks. Be right back.
On Mon, Apr 18, 2016 at 5:57 PM, Mark Silverberg notifications@github.com wrote:
@BAP-Jeff https://github.com/BAP-Jeff - I think you missed a pipe in the header row
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211598616
DRUGS file sample: https://drive.google.com/file/d/0B9yZheZrBn54RHk1SmszenNnYTg/ DRUGS file complete: https://drive.google.com/file/d/0B9yZheZrBn54MGplUW5BeVRhMDQ/
@marks apologies for the delayed reply on your questions in https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210890927.
Unfortunately, I can't get your links to work. I'm getting a message "Sign in with your Socrata ID". But when I register a new ID, I get "The current user doesn't have access to this view". For the API links, I'm getting a "Private Dataset" message.
@BAP-Jeff, it rocks!
DRUGS file complete Hackathon.zip unzips (using tar xzvf Hackathon.zip
):
DRUGS file sample HackathonSample.zip unzips (using tar xzvf HackathonSample.zip
):
I verified the columns for the sample file align and match the header you provided. So we might be good to go with the drug formulary files!
@dportnoy mind testing those links again?
@BAP-Jeff, I believe you said you were also looking at PLANS files, right?
Some questions and possible improvements for DRUGS:
ServiceAreaID
and FormularyID
fields are equal. Where are they coming from?plan_id
, instead of StandardComponentID
(Also add column plan_id_type
, just in case, although it's typically HIOS_PLAN_ID)marketing_name
, instead of PlanMarketingName
drug_tier
, instead of DrugStatus
rxnorm_id
, instead of RxCUI
prior_authorization
, instead of PAStatus
(use true/false, instead of y/n)step_therapy
, instead of STStatus
(use true/false, instead of y/n)quantity_limit
, instead of QLStatus
(use true/false, instead of y/n)marketplace_category
(FFM/SPM)issuer_name
(in addition to the PlanMarketingName name you already have)So exciting!
@marks Progress! Data grid and APIs seem to be working, but data explorer redirects to https://healthdata.demo.socrata.com/. I can dig into the detail by tomorrow morning.
@dportnoy think I fixed my typo. Gonna load @BAP-Jeff's file so that I can play with some aggregations etc via API as opposed to having to work with the full file which is a bit large ;)
@dportnoy, I have not been working with the PLANS files yet.
Regarding your questions:
On Mon, Apr 18, 2016 at 10:01 PM, David X Portnoy notifications@github.com wrote:
@BAP-Jeff https://github.com/BAP-Jeff, I believe you said you were also looking at PLANS files, right?
Some questions and possible improvements for DRUGS:
-
The ServiceAreaID and FormularyID fields are equal. Where are they coming from?
Would it be possible to use snake case and naming that's consistent with the schema spec https://github.com/CMSgov/QHP-provider-formulary-APIs, rather than camel case and descriptive naming. For example...
plan_id, instead of StandardComponentID
- marketing_name, instead of PlanMarketingName
- drug_tier, instead of DrugStatus
- rxnorm_id, instead of RxCUI
- prior_authorization, instead of PAStatus
- step_therapy, instead of STStatus
- quantity_limit, instead of QLStatus
- etc.
Would you want to add a couple more fields from the PUF file:
- marketplace_category (FFM/SPM)
- issuer_name (in addition to the PlanMarketingName name you already have)
So exciting!
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211685754
Here is what the modified header would look like:
state_code|issuer_id|plan_id|market_coverage|plan_marketing_name|hios_product_id|service_area_id|formulary_id|plan_type|metal_level|drug_tier|rxnorm_id|drug_name|prior_authorization|step_therapy|quantity_limit
Sorry those pipes are hard to see.
@dportnoy, I am taking a quick look at plans.json. I just downloaded them all and now playing with the extraction. Probably not a big challenge.
@BAP-Jeff's drugs.json extract is on Socrata exploration experience which also links to export and API at https://healthdata.demo.socrata.com/view/7unb-qpyc if anyone wants to use it to take a look. If there is a data dictionary explaining each field, I can put that in to provide a little more context to users. Hope it's helpful!
Mark
@marks, @BAP-Jeff, way to team up and get the DRUGS file built and hosted!
@BAP-Jeff, thanks for the updated header record! To answer your question, marketplace_category
and issuer_name
can be found in the "Machine-readable URL PUF" seed file.
@dportnoy, I see the issuer_name but I don't see the marketplace_category.
On Tue, Apr 19, 2016 at 9:36 AM, David X Portnoy notifications@github.com wrote:
@marks https://github.com/marks, @BAP-Jeff https://github.com/BAP-Jeff, way to team up and get the DRUGS file built and hosted!
@BAP-Jeff https://github.com/BAP-Jeff, thanks for the updated header record! To answer your question, marketplace_category and issuer_name can be found in the "Machine-readable URL PUF" http://download.cms.gov/marketplace-puf/2016/machine-readable-url-puf.zip seed file.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211925504
@marks, I was finally able to get to all the links in https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210890927. Great start on those files! I see that you combined arrays and hierarchies into single fields for simplicity. I think next step should be to generate the complete files, so that at least we have something useful as a starting point for the weekend. We could always improve later as time permits. As with @BAP-Jeff's DRUGS file, we can use the datasets you already built as the sample subsets for app development and testing.
Is there a naming convention for fields that start with underscore or contain period?
Not crucial but... Consider pulling in a few of the attributes from the "Machine-readable URL PUF" seed file or other cross-reference between PLANS-PROVIDERS-DRUGS that might be useful for analytics.
@dportnoy, I have the plans.json material parsed. What is less obvious is how we want to conflate these guys into a single record...actually I don't think it makes sense to do so. Have you looked at it at all? I could create plan, network, formulary and cost_sharing files if you like, joined by plan_id or something. Looks like the benefit piece is a work in progress.
Let me know. Unlike the formulary these files are extremely easy to work with (read: small), so I can iterate much more quickly....
@BAP-Jeff, on your questions:
marketplace_category
field from the latest PUF. Here's a PUF from December 2015 that has it: https://github.com/demand-driven-open-data/ddod-intake/blob/master/_files/Machine_Readable_PUF_2015-12-21.xlsx@BAP-Jeff, @marks, @cornstein, if time allows it might be useful to figure out how to represent the plans that have missing data. In other words, those who either:
Seems that having missing values would be useful for both analytics and consumer applications. The question is where to best capture this info. Thoughts?
@dportnoy, I am out for the rest of the week EOB Wednesday. I am going to prioritize doing another cut of the formulary data with the additional fields. Then I will create the plans data. Unlikely I am going to get to the slides at this moment. Sorry.
As for your piece on who's missing, #1 is easy, #2 takes a bit of manual entry, and #3 (at least for me ends up in bucket #2).
Should I just add a column to say the machine readable puf with "GOOD", "BAD" for the URL provided?
All, here is my quick attempt to stitch something together for plans.json.
Four files joined on plan_id and/or plan_id-drug_tier. Take a look, just one plan for now.
@BAP-Jeff,
Should I just add a column to say the machine readable puf with "GOOD", "BAD" for the URL provided?
Sure that could work. (If it's easier, we can have a separate addendum file with the same header, but containing just the plans that couldn't be processed. Analysts could simply append this file to the main one, if they want to include the missing data points.)
@BAP-Jeff, on https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-212055405
==> PlanOut.pip <== plan_id|marketing_name|summary_url|marketing_url|formulary_url|plan_contact|last_updated_on
==> PlanNetworkOut.pip <== plan_id|network
==> formularyOut.pip <== plan_id|drug_tier|mail_order
==> costOut.pip <== plan_id|drug_tier|pharmacy_type|copay_amount|copay_opt|coinsurance_rate|coninsurance_opt
Looks good! Let's load. A couple observations for future improvement:
network_tier
?telemedicine
field
Putting out a call to those interested in making an impact by contributing to public data projects... Looking for somebody to create a new public dataset (and accompanying source code).
Background
In November 2015, the Centers for Medicare & Medicaid Services (CMS) enacted a new regulatory requirement for health insurers who list plans on insurance marketplaces. They must now publish a machine-readable version of their provider network directory and drug formulary, publish it to a specified JSON standard, and update it at least monthly. This data has just recently become accessible to the public. Some of its uses can be found in the Bayes Impact hackathon "prompts" or in at least 7 DDOD use cases.
Challenge
While these newly available datasets can significantly benefit consumer health applications and be used in a range of healthcare analytics, the current format doesn't lend itself to doing so.
Request
Write code that does the following:
Run the code and let us know where to find the resulting files. We should be able to find a good home for them, so that they enjoy widespread use.
If you can do this, you’ll be an official Open Data Hero! (Spandex optional.)