Explore and/or tidy the FDA_NDC_Product dataset

darya-akimova commented 6 years ago

Status

@proof-by-accident has made progress on tidying the FDA_NDC_Product.csv dataset and is planning on tackling the exploration questions below.

Task

Explore and/or possibly tidy the FDA_NDC_Product.csv dataset, found on data.world
Data dictionary: https://github.com/Data4Democracy/drug-spending/blob/master/datadictionaries/FDA_NDC_Product.md Tidy format reference: https://ramnathv.github.io/pycon2014-r/explore/tidy.html

What we're looking for

Tidying: - Main columns that need tidying are: nonproprietaryname, substancename, active_numerator_strength, and active_ingred_unit. All of these columns may contain multiple values in one cell if a drug has multiple active ingredients. Ideal format is a separate row for each active ingredient (but other suggestions on better formatting are welcome)

~~Convert all string columns to lower case (can leave pharm_classes column as if, if uncomfortable evaluating the quality of the result)~~

~~Check if nonproprietaryname and substancename have the same values (after converting both to matching format)~~
Tidying completed! Tidied file: fda_ndc_product_tidy.csv on data.world.

Potential questions for exploration:

How many drugs have multiple active ingredients?
How many of the drugs found in the Medicare spending datasets have multiple active ingredients according to the FDA_NDC_Product,csv dataset. Try matching proprietaryname from the FDA_NDC_Product.csv dataset to the drugname_brand column in the spending_201x.csv datasets to address this question. After matching, what is the relationship between the nonproprietaryname and/or substancename columns from the FDA_NDC_Product.csv dataset to the drugname_generic column in the spending_201x.csv datasets.
How many of the drugs can at all be matched between the Medicare spending datasets and the FDA_NDC_Product dataset?
Anything else that seems interesting

How this will help

An option for matching the drugs in the Medicare spending datasets to therapeutic uses is to do so by active ingredients. It seems that the drugname_generic column in the spending datasets should be the main active ingredient that can be used for matching, but I had not considered drugs that may have multiple active ingredients (although drugname_generic appears to list multiple compounds for some drugs, so it may also include the active ingredients). The FDA_NDC_Product.csv dataset seems to contain a comprehensive list of the active ingredients in these drugs. A tidying of the FDNA_NDC_Product dataset and a comparison between this dataset and the Medicare spending datasets is an important step in accurately matching drugs to therapeutic uses.

proof-by-accident commented 6 years ago

Hi, I'm interested in this task. I'm new to this project and online collab in general, but not to cleaning/analyzing data sets for offline work, think it would be a good fit?

darya-akimova commented 6 years ago

Hey, welcome! Sounds like you'd definitely be a good fit with those skills, and collaborating on github is fairly straightforward after some time using it. The D4D github playground repo is a great resource to get started, if you haven't checked it out yet and would like some practice. Is there a particular part of this task that you'd like to start with? Or any questions that I can answer?

proof-by-accident commented 6 years ago

Thanks, I'm excited to help out! I haven't seen the playground yet, thanks for the heads up, I'll check that out for sure. To start I think I could Tidy the dataset without too much difficulty, I'm familiar with basic tidyverse stuff and read through the attached reference page and "Principles of Tidy Data". Are there any particular tools or other standard that you would prefer I use?

Beyond that I would like to at least address the "Potential questions" listed here. I think I saw somewhere that you prefer Jupyter or R Markdown for exploration/analysis, would that be appropriate here and is there a standard format I should follow?

darya-akimova commented 6 years ago

Congrats on your first commit and pull request! 👍

Thanks for your help, and feel free to use whichever tools you're most comfortable with. Please just save the tidy result to a .csv file, which makes for easy uploading to data.world where the datasets for this project live.

A Jupyter notebook or an R markdown file for any analysis work would be great. There's not really a standard format, the main thing I would ask for is please add in comments or a quick explanation here and there to make it easier for others to follow along with what you've done, but that's about it. Would also be appreciated if you could knit the file to HTML (if you use R markdown) or save the jupyter notebook in such a way that the outputs of your work are shown (I'm not entirely sure how to do this, jupyter is not my strong suit haha) so that it's easy to review what was done without rerunning the notebook/cells. Please save the files to the appropriate language folder in the repo (R or Python), either in the datawrangling, analysisviz, or notebook folders (these have to be reorganized a bit anyway, so doesn't matter so much where it goes). Otherwise, it's really up to you how you want to do this.

proof-by-accident commented 6 years ago

Great, thanks!

darya-akimova commented 6 years ago

I had forgotten to ask, but have you joined the D4D slack and the #p-drug-spending channel there?

proof-by-accident commented 6 years ago

I have, I'm "peter" on the channel.

Data4Democracy / drug-spending