issue #76 - Join Medicare Part D spending data to ATC Classification System

Anandkarthick commented 6 years ago

Output files and Jupyter notebook for Issue 76..

darya-akimova commented 6 years ago

Hey Anandkarthick, Thanks for all your work, but after reviewing the repo, there's a number of issues that need to be resolved before it can be merged:

The spending_data_merged file is unnecessary, please remove it because it just adds on more unnecessary bulk to the repo. The original .xlsx file that the spending .csv files originate from already had all of the spending data in one. It is helpful that you picked on something that I missed, which is that the person to prepared the spending .csv files removed drugs if they were not in the dataset for that year, which leads to unequal numbers of drugs across the csv files.
I had specifically asked for only the merge on the spending-2011.csv file only because there are many drugs left unmatched if you just match by the drugname_generic column and it is important to resolve the reasons why they were not matched for one file before merging all the files. The spending name columns are really messy and, by joining all of the data together before all of the merging issues have been resolved, you've made the task of figuring out what must be done to the columns to merge more of the drugs with the atc codes far more complicated.
The spending_data_match.csv.zip is far too large and messy to be helpful. There's a number of extra columns that are not useful at all (column_a_x and column_a_y for example). Another reason why the file is so large, is that because you matched on drugname_generic and ignored the drugname_brand column, pandas created a merge file where the atc dataset rows were repeated for every occurrence of drugname_brand in the atc dataset, even in cases where there is a mismatch in the drugname_brand between the spending data and the atc data.

If you'd like to keep working on this, I have a few recommendations. After realizing there's drugs missing from each of the year files, I've created and imported spending_part_d_2011to2015_tidy.csv onto data.world. Please set yourself up to use the data.world library for python ( https://data.world/integrations/python ) and then use SQL to pull down only the name columns from either spending_part_d_2011to2015_tidy.csv and then select only the unique rows or the original .xlsx file Medicare_Drug_Spending_PartD_All_Drugs_YTD_2015_12_06_2016.xlsx. Working with only these columns should keep all output files down in size. I will also change the issue to make this recommendation.

Anandkarthick commented 6 years ago

Thank you for the comments and recommendation.

spending_data_merged.csv - I think it makes sense to remove this.. I overlooked the data sets in data.world and created this merged file as part of the effort. I'll remove this file.

spending-2011.csv & spending_data_match.csv.zip - this is my mistake.. I used only generic name to create a match. I just completed python integration with data.world. I'll modify the notebook for spending_2011.csv dataset and create a pull request again.

darya-akimova commented 6 years ago

Great, thanks again for your continued effort on this project! Please feel free to ask me any questions as you go along.

darya-akimova commented 6 years ago

I'm going to close this pull request without merging because there doesn't seem to have been progress and the ATC merging is on hold. If there's more progress on this, please feel free to submit another pull request in the future.

Data4Democracy / drug-spending

issue #76 - Join Medicare Part D spending data to ATC Classification System #78