Tidy drug_list.json - Githubissues

Status

Assigning this to myself. Currently working on formatting the nested list of therapeutic areas into a workable format.

Task

Tidy and/or possibly explore the drug_list.json dataset, found on data.world
Data dictionary: https://github.com/Data4Democracy/drug-spending/blob/master/datadictionaries/drug_list.md Tidy format reference: https://ramnathv.github.io/pycon2014-r/explore/tidy.html

What we're looking for

Tidying:

Convert the .json to a .csv
Convert to tidy format, particularly paying attention to the drug classes
Separate the name column into a brand_name and generic_name, or similar, where appropriate
Cleanup the approval_status column, so that the date can be easily converted to date format

Other:

Explore how many of the drugs can be matched to the Medicare spending data?
How many drugs have multiple categories? Could the information in this dataset be useful for categorizing drugs based on therapeutic use?

How this will help

The drug_list.json and the usp_drug_classification.csv files seem to include the most accessible drug category information, as in, the classification systems lean more towards therapeutic classification, rather than scientific/pharmacological like some of the others. However, the drug_list.json needs some tidying to convert it into a more user-friendly format. Another issue with this dataset is that the specific_treatment column will need some language processing in order to make this column usable. Need to know if the work will be worth it, hence need to know how many of the drugs from this file are in the Medicare spending files.

Data4Democracy / drug-spending

Tidy drug_list.json #71

Status

Task

What we're looking for

How this will help