Data4Democracy / drug-spending

Project to understand pharmaceutical spending, currently focused on US government programs.
73 stars 46 forks source link

Cleaning of CMS open payments general payments data #55

Closed sangxia closed 7 years ago

sangxia commented 7 years ago

Scripts for checking basic consistency, splitting information into separate tables, and creating standardized names for manufacturers. Some more work needs to be done for the general payments table, but hopefully the scripts could be useful for people working on other parts of this dataset.

It could also be good to have a naming convention for companies that have changed names, been acquired, etc.

For some reason the Date_of_Payment field of 2015 contains a lot more error than 2013 and 2014.

jenniferthompson commented 7 years ago

Thanks so much for this @sangxia! Would you mind adding some comments to make things a bit easier for our reviewers?

sangxia commented 7 years ago

Thanks for the comments @mattgawarecki ! I made some changes and added some comments.

mattgawarecki commented 7 years ago

@sangxia Sorry it's taken a while for me to get back to this. I'm trying to re-review it right now. 👍

mattgawarecki commented 7 years ago

@sangxia I think your changes helped quite a bit! Sorry it took so long (again) but I'll be merging this PR momentarily.

If it's okay, though, I might like to take a stab at refactoring the scripts to make them more newbie-friendly. Does that sound alright to you?

sangxia commented 7 years ago

@mattgawarecki No problem. Regarding refactoring, is there something specific I should pay attention to? I am working on some more code, so I think it would be good that I keep that in mind. Thanks.

mattgawarecki commented 7 years ago

From a high level, I think smaller functions are always a good thing to strive for. Code structure and style can turn into a very philosophical discussion, but I think the best description I've heard is something like, "write like the person reading is completely new to what you're doing." It's a bit of an over-simplification, but I think it's been a useful way for me to evaluate my own code day-to-day.