dportnoy commented 8 years ago

Putting out a call to those interested in making an impact by contributing to public data projects... Looking for somebody to create a new public dataset (and accompanying source code).

Background

In November 2015, the Centers for Medicare & Medicaid Services (CMS) enacted a new regulatory requirement for health insurers who list plans on insurance marketplaces. They must now publish a machine-readable version of their provider network directory and drug formulary, publish it to a specified JSON standard, and update it at least monthly. This data has just recently become accessible to the public. Some of its uses can be found in the Bayes Impact hackathon "prompts" or in at least 7 DDOD use cases.

Challenge

While these newly available datasets can significantly benefit consumer health applications and be used in a range of healthcare analytics, the current format doesn't lend itself to doing so.

First, while there is a single "Machine-readable URL PUF" seed file, the actual data is scattered throughout thousands of URLs.
Second, for analytics and aggregate operations, it would be beneficial to convert the JSON schema into a tabular data format.
Request

Write code that does the following:

Crawls the URLs starting with the "Machine-readable URL PUF" seed file found on this page: https://www.cms.gov/CCIIO/Resources/Data-Resources/marketplace-puf.html.
- (As of 3/1/2016, there are 636 plan URLs, 23,711 provider URLs and 1,662 formulary URLs.)
Converts the data from the JSON schema (https://github.com/CMSgov/QHP-provider-formulary-APIs) into a tabular format with the same fields.
- (There may be some challenges converting multiple independent array fields from JSON to a tabular row.)
Aggregates the results into CSV or text delimited text files, such that we have files for: Plans, Providers, and Formularies.

Run the code and let us know where to find the resulting files. We should be able to find a good home for them, so that they enjoy widespread use.

If you can do this, you’ll be an official Open Data Hero! (Spandex optional.)

dportnoy commented 8 years ago

@marks, @ftrotter, @cornstein, @BAP-Jeff, @loranstefani, @Jordanrau, @HeatherBrotsos, @lloydbrodsky, Since you're all past contributors, I wanted to see if any of you would take the challenge. Feel free to pass it on to your colleagues as well.

dportnoy commented 8 years ago

In case it's helpful, I started working on the code, but just haven't had enough time to get to it lately. Feel free to use. You can find it here: https://github.com/dportnoy/health-insurance-marketplace-analytics

marks commented 8 years ago

@dportnoy - looks like you got pretty far.. would you mind documenting what it does so far and what areas need attention? Would definitely help so others dont accidentally reinvent the wheel

dportnoy commented 8 years ago

@marks You give me entirely too much credit. But point taken. I'll write up a summary and post it.

dportnoy commented 8 years ago

Suggestions for JSON to tabular (CSV/TSV) column mapping

Translation from JSON schema into tabular should be straight forward...

Files: Break into 3 core files by schema: plans, providers, drugs
Columns: Schema fields get translated into individual columns
Arrays: Biggest challenge is with arrays. Multiple independent arrays are especially problematic, since you can't just create a new row with that value
- Option 1: Just leave an array as a list within a single field.
- (Easiest to translate from JSON, but hard for analytics)
- Option 2: Pivot array into separate columns.
- (Easiest for analytics, but there's a long tail of typos and outliers.)
- Option 3: Normalize it by breaking arrays into separate files with foreign keys, such as plan id and provider NPI.
- (Great for loading into a relational database, but requires multiple steps before use for analytics.)

marks commented 8 years ago

I put something together really quickly that takes insurance provider's index.json URLs and flattens the data from the formulary_urls array.

Sample dataset: https://healthdata.demo.socrata.com/view/979j-m4qb Code: https://github.com/marks/health-insurance-marketplace-analytics/commit/3507b4738fbb1ba75126e474bd6c24f0baf9d133

cornstein commented 8 years ago

Really interesting output, but I wonder if there's a way to make this a bit more useful. Perhaps we can have a brief chat?

On Sat, Apr 16, 2016 at 11:44 AM, Mark Silverberg notifications@github.com wrote:

I put something together really quickly that takes insurance provider's index.json URLs and flattens the data from the formulary_urls array.

Sample dataset: https://healthdata.demo.socrata.com/view/979j-m4qb Code: marks/health-insurance-marketplace-analytics@3507b47 https://github.com/marks/health-insurance-marketplace-analytics/commit/3507b4738fbb1ba75126e474bd6c24f0baf9d133

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210842613

marks commented 8 years ago

@cornstein - definitely... That's just a quick visualization of the data. You can export it (all or a slice) or use the API as well, of course.

Kind of hectic day but happy to jump on a call tomorrow or continue to chat here if you're interested in a specific view

Mark Silverberg (m) 512 826 7004 http://twitter.com/skram

On Apr 16, 2016, at 11:56 AM, cornstein notifications@github.com wrote:

Really interesting output, but I wonder if there's a way to make this a bit more useful. Perhaps we can have a brief chat?

On Sat, Apr 16, 2016 at 11:44 AM, Mark Silverberg notifications@github.com wrote:

I put something together really quickly that takes insurance provider's index.json URLs and flattens the data from the formulary_urls array.

Sample dataset: https://healthdata.demo.socrata.com/view/979j-m4qb Code: marks/health-insurance-marketplace-analytics@3507b47 https://github.com/marks/health-insurance-marketplace-analytics/commit/3507b4738fbb1ba75126e474bd6c24f0baf9d133

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210842613

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

dportnoy commented 8 years ago

Here are some options for the end product, along with their uses. They are listed in increasing order of difficulty:

AGGREGATE: Aggregate the crawled URLs into a small number of files, keeping the original QHP schema format in JSON: plans, providers, drugs.
- Eliminates the need to crawl thousands of URLs is no longer needed.
- Expect to run into file size challenges, since there are thousands of URLs and some files may be hundreds of megabytes.
- This format may be adequate for loading into a NoSQL/document database for use with non-analytical consumer web apps.
FLATTEN: Make the format tabular, in the easiest way possible.
- Method: The files produced can line up exactly with the QHP schema entities pulled from URL links: plans, providers, drugs. For arrays, you could put multiple values in a single field.
- The goal here is to create files that can be hosted ASAP and used for rudimentary analytics.
NORMALIZE: Normalize the work product from item # 2 into relational tables, tied with foreign keys.
- This format is useful for loading into a relational database for use in consumer apps and web services, especially if they require calculations and transformations. Convenient for codathons and challenges.
JOIN & PIVOT: Make a tabular format that’s easiest for analytics by creating one row per entity with all possible attributes.
- For example, in addition to the fields in the drugs.json schema and its plans subtype, include fields from "Machine-readable URL PUF" XLSX seed file: Marketplace Category, State, Issuer ID, Issuer Name. You can find other joins between the entities of the QHP schema.
- Expect to run into challenges for entities with multiple arrays that are free-form or have a long tail of values.

marks commented 8 years ago

I've made a bit more progress. The latest code is available at https://github.com/marks/health-insurance-marketplace-analytics/blob/master/flattener/flatten_from_index.py

I have code that starts* to flatten all three file types: drugs.json, plans.json, and providers.json. I have only run providers.json and drugs.json on a few index.json's as it can take a while to process.

The flattened data can be explored/downloaded/called-with-an-API from the following links:

File Type Flattened	Data Explorer (you can export from here too)	Data Grid (you can do roll ups and filtering here)	API Docs
`plans.json`	https://healthdata.demo.socrata.com/view/yjgn-upfm	https://healthdata.demo.socrata.com/d/x6bt-9iym	https://dev.socrata.com/foundry/healthdata.demo.socrata.com/55ch-w329
`providers.json`	https://healthdata.demo.socrata.com/view/pqgg-pgsy	https://healthdata.demo.socrata.com/d/axbq-xnwy	https://dev.socrata.com/foundry/healthdata.demo.socrata.com/5ht2-73gj
`drugs.json`	https://healthdata.demo.socrata.com/view/979j-m4qb	https://healthdata.demo.socrata.com/d/jaa8-k3k2	https://dev.socrata.com/foundry/healthdata.demo.socrata.com/6ngr-bz3d

* Not all data is flattened at this time to decrease the number of rows. For example, languages in the provider file are serialized as a JSON array as are addresses and specialties

dportnoy commented 8 years ago

@BAP-Jeff, hi again. Welcome to use case #56! To answer your question from issue #52... Having a ready to go dataset would be ideal for the Bayes Hack codathon for next weekend. It's identified as the central dataset for building consumer healthcare apps. You'll see a comment above that identifies 4 possible file types: https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210875780. It's good to have a few people working on this, because some of the aspects (such as file sizes and cleanup for analytics) are challenging. Having the code would be a useful resource for the community to refresh the data in the future.

ghost commented 8 years ago

Hello All,

As David knows, I have been playing with the drugs.json set for a while as I build some commercial applications for the formulary space. There are a ton of challenges in using the data as supplied but overall it is a good first year effort by the agency.

If I understand the challenges David laid out, I have been pursuing Option

4 for my app. I am using a couple simple Python scripts to (1) read a

list of URLs from the Machine Readable PUF, then retrieve the drugs.json files for each (2) then I run a second script to rip apart the json documents into a flat pipe delimited file.

I am happy to share the code and/or the end result (warning around 35 to 40 million records). What do we want for the hack-a-thon? Have to admit I have not participated in one before, so I am not sure if people are only interested in the end result (upload the dataset to say Socrata or kaggle) or do they want to understand the process, so they can roll their own?

Let me know which way the group wants to go.

Jeff

On Sat, Apr 16, 2016 at 11:46 PM, David X Portnoy notifications@github.com wrote:

@BAP-Jeff https://github.com/BAP-Jeff, hi again. Welcome to use case #56 https://github.com/demand-driven-open-data/ddod-intake/issues/56! To answer your question from issue #52 https://github.com/demand-driven-open-data/ddod-intake/issues/52... Having a ready to go dataset would be ideal for the Bayes Hack codathon for next weekend. It's identified as the central dataset for building consumer healthcare apps. You'll see a comment above that identifies 4 possible file types: #56 (comment) https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210875780. It's good to have a few people working on this, because some of the aspects (such as file sizes and cleanup for analytics) are challenging. Having the code would be a useful resource for the community to refresh the data in the future.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210948520

marks commented 8 years ago

@BAP-Jeff - awesome. I think both the data and the provenance (how you got it, including any scripts/documentation) would be much appreciated by participants as well as the community in general. Would be happy to look into hosting it onto Socrata if you can provide the raw data

Mark

ghost commented 8 years ago

Okay thanks. I will put together some things tonight. Much of my time in working with this dataset is dealing with the data discrepancies (i.e. multiple statuses on a drug for a plan, use of NDCs vs RxCUIs, missing restrictions, bogus URLs), but showing off the millions of records that are produced will be cool. Plus hopefully making it more accessible will get more eyes on it...and we can get those things corrected.

Here is the "process" I would propose that we show off (for the drugs.json branch):

Download latest Readable PUF
Extract the details for the URLs and Plans, excluding dental and NOT SUBMITTED records
Iterate over that list, pulling down the valid json files
Then iterate over the json files and create csv files
Then push them to Socrata and/or other spots

David, do you need a "deck" or some sort of visual to present? Help me out if I am going astray...

Jeff

On Sun, Apr 17, 2016 at 10:08 AM, Mark Silverberg notifications@github.com wrote:

@BAP-Jeff https://github.com/BAP-Jeff - awesome. I think both the data and the provenance (how you got it, including any scripts/documentation) would be much appreciated by participants as well as the community in general. Would be happy to look into hosting it onto Socrata if you can provide the raw data

Mark

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211026131

marks commented 8 years ago

I dont want to make more work for you, @BAP-Jeff, but I'd be very interested in what your CSVs are looking like. I am running some code on the index.jsons I found in a previously shared xlsx to flatten all three file types (but they may not have all important info... I am not a SME in this area

ghost commented 8 years ago

I actually use pipe delimiting but can easily change it out. I am attaching an example source json file and what I am outputting (tried to find a small one). It is very close to what you had shown off before Mark. I can easily add some more columns to this from the Plan Attributes PUF file to be more complete (IssuerID, Region, etc), but sounded like you wanted to see what the current state was. I also included the python script I used to parse the file.

The format of the output is:

Source JSON file name
PlanIDType
PlanID
DrugStatus
DrugID
PA
ST
QL

Archive.zip

HIXFormExtract.py.zip

marks commented 8 years ago

@BAP-Jeff - Yup! Looks quite similar. I would think it would be very helpful to folks to include info from the Plan Attributed PUF. Will standby for files so that we can get them on Socrata* (and Kaggle and other platforms) for dissemination/exploration/API acces

* Full disclosure: I work for Socrata and we're more than happy to help folks download the data in a number of formats. Can accept pipe delimited, no big deal.

ghost commented 8 years ago

Okay I will be gone most of the day but will get on this later today.

On Apr 17, 2016, at 11:04 AM, Mark Silverberg notifications@github.com wrote:

@BAP-Jeff - Yup! Looks quite similar. I would think it would be very helpful to folks to include info from the Plan Attributed PUF. Will standby for files so that we can get them on Socrata* for dissemination/exploration/API acces

Full disclosure: I work for Socrata and we're more than happy to help folks download the data in a number of formats. Can accept pipe delimited, no big deal.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

dportnoy commented 8 years ago

@BAP-Jeff, great to have you contribute! Let's see if I could answer most of your questions...

STORAGE: For now, the datasets generated can be stored wherever is most convenient for your methods. They can always be copied elsewhere later.
- Ultimately, Kaggle agreed to put the datasets in their HHS section for use with the codathon and other future activity: https://www.kaggle.com/hhsgov/health-insurance-marketplace
- If needed, I can provide you with an S3 bucket to dump it in. (But I still need to determine the most appropriate permissions and access mechanism.)
- If Socrata is willing to host any portion of it, that's great too.
FILE SIZES: You asked about potential sizes. I think we need two sets of files:
- The complete files (with millions of records), which are obviously needed for analytics. We may end up needing to break these into multiple files for manageability.
- A small data sample subset, which can be loaded for proof of concepts with applications. Perhaps pick a few regions, but make sure the files are complete for a particular plan, so that joins are possible.
EXAMPLES: For your pipe-delimited file, we'd need to be sure to include either a header record or package a dictionary with it. For data.json processing, we can add more useful fields by joining with the "Machine-readable URL PUF" XLSX file and possibly attributes from other entities (plans and providers).
DECK: Yes, if you can put together a deck, it will be useful. I think data visual slides would be compelling -- both for confirming assumptions and demonstrating surprising "aha" moments.

ghost commented 8 years ago

David,

I think I can take on all the above but may need a little help with the Deck. Do you have a template you want to use - Powerpoint, Google, ??

I can produce the files tonight/tomorrow and upload them probably tomorrow.

@dportnoy give me the details on the kaggle stuff @marks let me know where to drop the files for Socrata

David, did you want to reserve any slides for ..."Issues we found in the data" or does that diminish the good vibes?

Jeff

marks commented 8 years ago

Jeff- if I can just get delimited files, I can take care of loading it for ya. Sounds like kaggle is top priority for this event though ;)

Mark Silverberg (m) 512 826 7004 http://twitter.com/skram

On Apr 17, 2016, at 7:18 PM, BAP-Jeff notifications@github.com wrote:

David,

I think I can take on all the above but may need a little help with the Deck. Do you have a template you want to use - Powerpoint, Google, ??

I can produce the files tonight/tomorrow and upload them probably tomorrow.

@dportnoy give me the details on the kaggle stuff @marks let me know where to drop the files for Socrata

David, did you want to reserve any slides for ..."Issues we found in the data" or does that diminish the good vibes?

Jeff

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

ghost commented 8 years ago

Okay I will loop back around to you as time is short.

On Sun, Apr 17, 2016 at 7:28 PM, Mark Silverberg notifications@github.com wrote:

Jeff- if I can just get delimited files, I can take care of loading it for ya. Sounds like kaggle is top priority for this event though ;)

Mark Silverberg (m) 512 826 7004 http://twitter.com/skram

On Apr 17, 2016, at 7:18 PM, BAP-Jeff notifications@github.com wrote:

David,

I think I can take on all the above but may need a little help with the Deck. Do you have a template you want to use - Powerpoint, Google, ??

I can produce the files tonight/tomorrow and upload them probably tomorrow.

@dportnoy give me the details on the kaggle stuff @marks let me know where to drop the files for Socrata

David, did you want to reserve any slides for ..."Issues we found in the data" or does that diminish the good vibes?

Jeff

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211128096

dportnoy commented 8 years ago

@BAP-Jeff Google slides. No specific template. Just the white default.

For Kaggle, upload to it will be done manually by their staff (since they're still working on making that functionality self-serve). So we need a staging location before the data gets copied there.

ghost commented 8 years ago

Okay I have the file ready to go. It is about 5GBs uncompressed (31 million records). The file has the following table def:

create table Hackathon (StateCode varchar(2), IssuerID char(5), StandardComponentID varchar(14), MarketCoverage varchar(18), PlanMarketingName varchar(142), HIOSProductID varchar(10), ServiceAreaID varchar(6), FormularyID varchar(6), PlanType varchar(9), MetalLevel varchar(12), DrugStatus varchar(100), RxCUI varchar(15), DrugName varchar(2000), PAStatus char(1), STStatus char(1), QLStatus char(1))

I pulled from Plan Attributes what I thought was the key information. Note: PlanID in the Formulary Files is not the same as PlanID in Plan Attributes. The analogous column is StandardComponentID, furthermore I ignored the CSR variations in Plan Attributes and joined the distinct records from Plan Attributes to the Formulary Files. If this didn't make sense...well I guess trust me...

I have added a header record to the file and am now waiting for the zipping to finish up. I assume it will get down to the 2GB range. I can throw it up into a S3 bucket and share it I guess, or GDrive. Let me know. Could be a few days on the slides....

Jeff

dportnoy commented 8 years ago

@BAP-Jeff, definitely throw it up on either S3 or gDrive, whichever works best! Ideally, you could also provide a smaller file that has a more manageable sample subset of this data, so that it can be used to test code.

dportnoy commented 8 years ago

@ftrotter, wanted to ping you and see if you're still working on any aspect of this.

ghost commented 8 years ago

Actually it zipped down to 250 megs so not such a monster. I also did a sample file for all of Humana in Georgia.

marks commented 8 years ago

@BAP-Jeff - I think you missed a pipe in the header row

ghost commented 8 years ago

Thank goodness someone is QAing this! Thanks @marks. Be right back.

On Mon, Apr 18, 2016 at 5:57 PM, Mark Silverberg notifications@github.com wrote:

@BAP-Jeff https://github.com/BAP-Jeff - I think you missed a pipe in the header row

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211598616

ghost commented 8 years ago

DRUGS file sample: https://drive.google.com/file/d/0B9yZheZrBn54RHk1SmszenNnYTg/ DRUGS file complete: https://drive.google.com/file/d/0B9yZheZrBn54MGplUW5BeVRhMDQ/

dportnoy commented 8 years ago

@marks apologies for the delayed reply on your questions in https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210890927.

Unfortunately, I can't get your links to work. I'm getting a message "Sign in with your Socrata ID". But when I register a new ID, I get "The current user doesn't have access to this view". For the API links, I'm getting a "Private Dataset" message.

dportnoy commented 8 years ago

@BAP-Jeff, it rocks!

DRUGS file complete Hackathon.zip unzips (using tar xzvf Hackathon.zip):

Size: 5.1GB
Rows: 330,150,734 (including header row)
Included header: StateCode|IssuerID|StandardComponentID|MarketCoverage|PlanMarketingName|HIOSProductID|ServiceAreaID|FormularyID|PlanType|MetalLevel|DrugStatus|RxCUI|DrugName|PAStatus|STStatus|QLStatus
Alternate schema-conforming header: state_code|issuer_id|plan_id|market_coverage|plan_marketing_name|hios_product_id|service_area_id|formulary_id|plan_type|metal_level|drug_tier|rxnorm_id|drug_name|prior_authorization|step_therapy|quantity_limit

DRUGS file sample HackathonSample.zip unzips (using tar xzvf HackathonSample.zip):

Size 22MB
Rows 127,443 (including header row)
Headers: Same as complete file

I verified the columns for the sample file align and match the header you provided. So we might be good to go with the drug formulary files!

marks commented 8 years ago

@dportnoy mind testing those links again?

dportnoy commented 8 years ago

@BAP-Jeff, I believe you said you were also looking at PLANS files, right?

Some questions and possible improvements for DRUGS:

The ServiceAreaID and FormularyID fields are equal. Where are they coming from?
Would it be possible to use snake case and naming that's consistent with the schema spec, rather than camel case and descriptive naming. For example...
- plan_id, instead of StandardComponentID (Also add column plan_id_type, just in case, although it's typically HIOS_PLAN_ID)
- marketing_name, instead of PlanMarketingName
- drug_tier, instead of DrugStatus
- rxnorm_id, instead of RxCUI
- prior_authorization, instead of PAStatus (use true/false, instead of y/n)
- step_therapy, instead of STStatus (use true/false, instead of y/n)
- quantity_limit, instead of QLStatus (use true/false, instead of y/n)
- etc.
Would you want to add a couple more fields from the PUF file:
- marketplace_category (FFM/SPM)
- issuer_name (in addition to the PlanMarketingName name you already have)

So exciting!

dportnoy commented 8 years ago

@marks Progress! Data grid and APIs seem to be working, but data explorer redirects to https://healthdata.demo.socrata.com/. I can dig into the detail by tomorrow morning.

marks commented 8 years ago

@dportnoy think I fixed my typo. Gonna load @BAP-Jeff's file so that I can play with some aggregations etc via API as opposed to having to work with the full file which is a bit large ;)

ghost commented 8 years ago

@dportnoy, I have not been working with the PLANS files yet.

Regarding your questions:

I just double checked ServiceAreaID and FormularyID. I don't see them as being equal. The third character is an "S" for ServiceArea or "F" for Formulary. These items comes from the Plan Attributes PUF
I will change the naming. I was trying to be consistent with the actual Data Dictionaries on the PUF site
Which PUF file has the marketplace_category and issuer_name?

On Mon, Apr 18, 2016 at 10:01 PM, David X Portnoy notifications@github.com wrote:

@BAP-Jeff https://github.com/BAP-Jeff, I believe you said you were also looking at PLANS files, right?

Some questions and possible improvements for DRUGS:

-

The ServiceAreaID and FormularyID fields are equal. Where are they coming from?

Would it be possible to use snake case and naming that's consistent with the schema spec https://github.com/CMSgov/QHP-provider-formulary-APIs, rather than camel case and descriptive naming. For example...

plan_id, instead of StandardComponentID

marketing_name, instead of PlanMarketingName

drug_tier, instead of DrugStatus

rxnorm_id, instead of RxCUI

prior_authorization, instead of PAStatus

step_therapy, instead of STStatus

quantity_limit, instead of QLStatus

- etc.

Would you want to add a couple more fields from the PUF file:

marketplace_category (FFM/SPM)

issuer_name (in addition to the PlanMarketingName name you already have)

So exciting!

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211685754

ghost commented 8 years ago

Here is what the modified header would look like:

Sorry those pipes are hard to see.

ghost commented 8 years ago

@dportnoy, I am taking a quick look at plans.json. I just downloaded them all and now playing with the extraction. Probably not a big challenge.

marks commented 8 years ago

@BAP-Jeff's drugs.json extract is on Socrata exploration experience which also links to export and API at https://healthdata.demo.socrata.com/view/7unb-qpyc if anyone wants to use it to take a look. If there is a data dictionary explaining each field, I can put that in to provide a little more context to users. Hope it's helpful!

Mark

dportnoy commented 8 years ago

@marks, @BAP-Jeff, way to team up and get the DRUGS file built and hosted!

@BAP-Jeff, thanks for the updated header record! To answer your question, marketplace_category and issuer_name can be found in the "Machine-readable URL PUF" seed file.

ghost commented 8 years ago

@dportnoy, I see the issuer_name but I don't see the marketplace_category.

On Tue, Apr 19, 2016 at 9:36 AM, David X Portnoy notifications@github.com wrote:

@marks https://github.com/marks, @BAP-Jeff https://github.com/BAP-Jeff, way to team up and get the DRUGS file built and hosted!

@BAP-Jeff https://github.com/BAP-Jeff, thanks for the updated header record! To answer your question, marketplace_category and issuer_name can be found in the "Machine-readable URL PUF" http://download.cms.gov/marketplace-puf/2016/machine-readable-url-puf.zip seed file.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211925504

dportnoy commented 8 years ago

@marks, I was finally able to get to all the links in https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210890927. Great start on those files! I see that you combined arrays and hierarchies into single fields for simplicity. I think next step should be to generate the complete files, so that at least we have something useful as a starting point for the weekend. We could always improve later as time permits. As with @BAP-Jeff's DRUGS file, we can use the datasets you already built as the sample subsets for app development and testing.

Is there a naming convention for fields that start with underscore or contain period?

Not crucial but... Consider pulling in a few of the attributes from the "Machine-readable URL PUF" seed file or other cross-reference between PLANS-PROVIDERS-DRUGS that might be useful for analytics.

ghost commented 8 years ago

@dportnoy, I have the plans.json material parsed. What is less obvious is how we want to conflate these guys into a single record...actually I don't think it makes sense to do so. Have you looked at it at all? I could create plan, network, formulary and cost_sharing files if you like, joined by plan_id or something. Looks like the benefit piece is a work in progress.

Let me know. Unlike the formulary these files are extremely easy to work with (read: small), so I can iterate much more quickly....

dportnoy commented 8 years ago

@BAP-Jeff, on your questions:

https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211927446: Looks like CMS took out the marketplace_category field from the latest PUF. Here's a PUF from December 2015 that has it: https://github.com/demand-driven-open-data/ddod-intake/blob/master/_files/Machine_Readable_PUF_2015-12-21.xlsx
https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-211959206: Yes, it's not as obvious what to do for flattening out PLANS and PROVIDERS. That's what prompted the discussion of the 4 types of approaches in https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-210875780. You're suggesting a normalized approach (# 3), which is fine. In general, let's get the complete data loaded first in whatever format is the easiest. Then we can work on optimizing it for analytics or applications, as time allows. Sound good?

dportnoy commented 8 years ago

@BAP-Jeff, @marks, @cornstein, if time allows it might be useful to figure out how to represent the plans that have missing data. In other words, those who either:

Didn't provide an index.json URL at all
Provided a URL, but it didn't work
Had a working URL, but the content didn't conform to the schema

Seems that having missing values would be useful for both analytics and consumer applications. The question is where to best capture this info. Thoughts?

ghost commented 8 years ago

@dportnoy, I am out for the rest of the week EOB Wednesday. I am going to prioritize doing another cut of the formulary data with the additional fields. Then I will create the plans data. Unlikely I am going to get to the slides at this moment. Sorry.

As for your piece on who's missing, #1 is easy, #2 takes a bit of manual entry, and #3 (at least for me ends up in bucket #2).

Should I just add a column to say the machine readable puf with "GOOD", "BAD" for the URL provided?

ghost commented 8 years ago

All, here is my quick attempt to stitch something together for plans.json.

Four files joined on plan_id and/or plan_id-drug_tier. Take a look, just one plan for now.

plans.zip

dportnoy commented 8 years ago

@BAP-Jeff,

Should I just add a column to say the machine readable puf with "GOOD", "BAD" for the URL provided?

Sure that could work. (If it's easier, we can have a separate addendum file with the same header, but containing just the plans that couldn't be processed. Analysts could simply append this file to the main one, if they want to include the missing data points.)

dportnoy commented 8 years ago

@BAP-Jeff, on https://github.com/demand-driven-open-data/ddod-intake/issues/56#issuecomment-212055405

==> PlanOut.pip <== plan_id|marketing_name|summary_url|marketing_url|formulary_url|plan_contact|last_updated_on

==> PlanNetworkOut.pip <== plan_id|network

==> formularyOut.pip <== plan_id|drug_tier|mail_order

==> costOut.pip <== plan_id|drug_tier|pharmacy_type|copay_amount|copay_opt|coinsurance_rate|coninsurance_opt

Looks good! Let's load. A couple observations for future improvement:

Should the field in PlanNetworkOut.pip be network_tier?
Consider including telemedicine field

demand-driven-open-data / ddod-intake

Aggregate machine-readable provider network directories and drug formularies into tabular format #56

Putting out a call to those interested in making an impact by contributing to public data projects... Looking for somebody to create a new public dataset (and accompanying source code).

Background

Challenge

Request

If you can do this, you’ll be an official Open Data Hero! (Spandex optional.)

Suggestions for JSON to tabular (CSV/TSV) column mapping

4 for my app. I am using a couple simple Python scripts to (1) read a

The ServiceAreaID and FormularyID fields are equal. Where are they coming from?

- etc.