Convert to tidy format, particularly paying attention to the drug classes
Separate the name column into a brand_name and generic_name, or similar, where appropriate
Cleanup the approval_status column, so that the date can be easily converted to date format
Other:
Explore how many of the drugs can be matched to the Medicare spending data?
How many drugs have multiple categories? Could the information in this dataset be useful for categorizing drugs based on therapeutic use?
How this will help
The drug_list.json and the usp_drug_classification.csv files seem to include the most accessible drug category information, as in, the classification systems lean more towards therapeutic classification, rather than scientific/pharmacological like some of the others. However, the drug_list.json needs some tidying to convert it into a more user-friendly format. Another issue with this dataset is that the specific_treatment column will need some language processing in order to make this column usable. Need to know if the work will be worth it, hence need to know how many of the drugs from this file are in the Medicare spending files.
After some exploration, realized that this dataset only goes back to 1995 and does not include generics, so it is too limited for the purposes of this project. Will close and abandon this issue.
Status
Assigning this to myself. Currently working on formatting the nested list of therapeutic areas into a workable format.
Task
Tidy and/or possibly explore the
drug_list.json
dataset, found on data.worldData dictionary: https://github.com/Data4Democracy/drug-spending/blob/master/datadictionaries/drug_list.md Tidy format reference: https://ramnathv.github.io/pycon2014-r/explore/tidy.html
What we're looking for
Tidying:
name
column into abrand_name
andgeneric_name
, or similar, where appropriateapproval_status
column, so that the date can be easily converted to date formatOther:
How this will help
The
drug_list.json
and theusp_drug_classification.csv
files seem to include the most accessible drug category information, as in, the classification systems lean more towards therapeutic classification, rather than scientific/pharmacological like some of the others. However, thedrug_list.json
needs some tidying to convert it into a more user-friendly format. Another issue with this dataset is that thespecific_treatment
column will need some language processing in order to make this column usable. Need to know if the work will be worth it, hence need to know how many of the drugs from this file are in the Medicare spending files.