Closed Anandkarthick closed 6 years ago
Hey Anandkarthick, Thanks for all your work, but after reviewing the repo, there's a number of issues that need to be resolved before it can be merged:
spending_data_merged
file is unnecessary, please remove it because it just adds on more unnecessary bulk to the repo. The original .xlsx file that the spending .csv files originate from already had all of the spending data in one. It is helpful that you picked on something that I missed, which is that the person to prepared the spending .csv files removed drugs if they were not in the dataset for that year, which leads to unequal numbers of drugs across the csv files. spending-2011.csv
file only because there are many drugs left unmatched if you just match by the drugname_generic
column and it is important to resolve the reasons why they were not matched for one file before merging all the files. The spending name columns are really messy and, by joining all of the data together before all of the merging issues have been resolved, you've made the task of figuring out what must be done to the columns to merge more of the drugs with the atc codes far more complicated. spending_data_match.csv.zip
is far too large and messy to be helpful. There's a number of extra columns that are not useful at all (column_a_x
and column_a_y
for example). Another reason why the file is so large, is that because you matched on drugname_generic
and ignored the drugname_brand
column, pandas created a merge file where the atc dataset rows were repeated for every occurrence of drugname_brand
in the atc dataset, even in cases where there is a mismatch in the drugname_brand
between the spending data and the atc data. If you'd like to keep working on this, I have a few recommendations. After realizing there's drugs missing from each of the year files, I've created and imported spending_part_d_2011to2015_tidy.csv
onto data.world. Please set yourself up to use the data.world library for python ( https://data.world/integrations/python ) and then use SQL to pull down only the name columns from either spending_part_d_2011to2015_tidy.csv
and then select only the unique rows or the original .xlsx file Medicare_Drug_Spending_PartD_All_Drugs_YTD_2015_12_06_2016.xlsx
. Working with only these columns should keep all output files down in size. I will also change the issue to make this recommendation.
Thank you for the comments and recommendation.
spending_data_merged.csv - I think it makes sense to remove this.. I overlooked the data sets in data.world and created this merged file as part of the effort. I'll remove this file.
spending-2011.csv & spending_data_match.csv.zip - this is my mistake.. I used only generic name to create a match. I just completed python integration with data.world. I'll modify the notebook for spending_2011.csv dataset and create a pull request again.
Great, thanks again for your continued effort on this project! Please feel free to ask me any questions as you go along.
I'm going to close this pull request without merging because there doesn't seem to have been progress and the ATC merging is on hold. If there's more progress on this, please feel free to submit another pull request in the future.
Output files and Jupyter notebook for Issue 76..