Data4Democracy / drug-spending

Project to understand pharmaceutical spending, currently focused on US government programs.
73 stars 46 forks source link

Year Over Year Increases. #42

Closed davidlibland closed 7 years ago

davidlibland commented 7 years ago

Added some python notebooks to do some preliminary analysis of year over year increases. Also incorporated the FDA's NDC data to associate drugs to their phamacological classes, and aggregate spending and use-counts across those classes. Steepest year over year increases are visualized both for individual drugs and across drug classes.

davidlibland commented 7 years ago

Hi @dhuppenkothen , Thanks for your feedback.

We might want to hold off merging this until the new repo structure is in place so that all the bits here can go into their proper place. I think that makes a lot of sense. In Part_D_with_uses.ipynb, where does the NDC data come from? It's being loaded from disk in the notebook: could that be made a query to wherever the data came from (e.g. data.world)? I didn’t have access to data.world at the time, but I’ve uploaded it now, so the new code will download it from there. More curiosity that anything: how many drugs did you manage to identify with their NDC pharma classes? All of the Part D ones or a subset? I'd be curious how big that subset is … There’s about 2500 left to be matched to their NDC classes, but those accound for less than 20% of the spending… I think we could match most of them if I clean the data better, or improve the matching algorithm. But another problem is that there are a lot of NDC pharma classes, and it would be better to put the drugs into larger therapeutic use groups; having the NDC classes around might help with that, though since it might help us to improve the matching algorithm (if we don’t directly know which therapeutic use groups a drug fits in, but we know which pharma class it fits in, and all other drugs in that class are used to treat high blood preasure, then it probably has the same use). I think out policy is not to have any data in the repo (it makes them large and unwieldy), so it might be best to remove the things in ./data/ and move them to data.world The idea of using machine learning for the clustering of names is cool, even if it doesn't seem to work. What I'd do is just print out the top 100 words + occurrences to the screen, and manually look at them. Then add things like "mg" and so on to the stop words. But you might be right: based on what I saw in the earlier DataFrames in your notebooks, the terms may be too specialized to cluster well. On a similar note, in the folder ./cms/ I was playing around with another set of definitions for drug usage. They might be useful here, too? Yes, I think we should incorporate as many definitions as possible (since I doubt any will be complete), and then get a graph relating drugs to various uses... I'm not entirely sure how to read the rainbow coloured plots in exploration_of_plan_b_yr_to_yr_increases.ipynb. Maybe having another sentence explaining what each colour represents might be useful? Probably I should use a stacked area plot instead: the colors are just different years…

Here's a slightly better view (except for the legend, and it's still too cluttered): download 'Unknown' refers to drugs which have not been matched to pharma classes, while 'Other' refers to drugs which have been matched (but were aggregated to reduce clutter).

- david

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Data4Democracy/drug-spending/pull/42#pullrequestreview-20919118, or mute the thread https://github.com/notifications/unsubscribe-auth/ASbly83uWnZxvKfpKr06BWcFHvRWJdJ0ks5ranjBgaJpZM4L7g_a.

mattgawarecki commented 7 years ago

@davidlibland What about possibly just a bar graph, if we're concentrating on just the top few categories?

EDIT: Derp, it's a longitudinal comparison. In that case, how about a line graph? Areas tend to confuse me.

mattgawarecki commented 7 years ago

@davidlibland Looks like you've still got a drugs_w_lrg_yr-yr_increases directory in the repository root. Can you make sure everything's moved out, then delete that directory from your repo, commit, and push again?

UPDATE: Let's also get rid of the cms directory since that data's coming from data.world.

davidlibland commented 7 years ago

Hi @mattgawarecki, Thanks, I just moved the files to python/notebooks/drugs_w_lrg_yr-yr_increases/ The part_d_with_uses pulls it's data from data.world, but it still writes to a local database (ignored in .gitignore); it just tries to correlate the drugs with their uses (according to the FDA NDC database), but it's incomplete, so I don't want to put the output on data.world.

mattgawarecki commented 7 years ago

Looks like at points I was actually commenting on things you were already working on fixing -- apologies for that :-)

Anyway though, I think your updates to match the new structure are just about everything we need to get it merged in. I'll do one last run-through tonight -- though if anybody reading this wants to beat me to it, you're more than welcome -- and we should have it merged in soon.