codeforboston / clean-slate-data

MIT License
27 stars 13 forks source link

Rerun PA as a Proxy Analysis with 2018 Data #31

Closed ghost closed 4 years ago

ghost commented 4 years ago

We ran the initial analysis with 2014 data. Does the similarity between PA & MA still hold with 2018 data?

2018 data: https://ucr.fbi.gov/crime-in-the-u.s/2018/crime-in-the-u.s.-2018/tables/table-69

This is a child of #6

Definition of Done: we know whether PA is still a valid proxy for MA.

mikemahoney218 commented 4 years ago

Blocker for #23, #24, #25, #12, #11, #10, #9, #8, #7, #6. Highest priority.

mikemahoney218 commented 4 years ago

I've started to look into this (as a much needed break from GRE studying :sweat_smile:), and went ahead and cleaned the data somewhat (committed to feature branch in f66c981). Think my next steps are to (a) repeat the data-expungement analysis and (b) see if we can't get a little more in-depth in this analysis work.

mikemahoney218 commented 4 years ago

So what I've found so far is not a clear justification for us to use the PA data, but is also not a clear reason to not use the PA data either.

Repeating analyses done before (following data-expungement as a framework), I'm finding that PA is consistently around the 80th percentile for similarity to MA -- that is, there are ~40 states who are worse fits for a MA proxy, but also ~10 states which are better fits. In particular, both New Jersey and Kentucky appear to be closer fits for MA, whether or not we scale our variables prior to calculating Euclidian distance (check out this pdf for code outputs about this).

This leaves me, personally, not entirely sure where to go from here. I'm going to try and spend a little more time quantifying this (via KNN & RF models) before next Tuesday, but I think this wasn't the clear result (in any direction) that we were hoping for.

mikemahoney218 commented 4 years ago

So on a call with Sheldon and Sana last night, we made the following decisions:

Each of these will enable us to become more comfortable using the PA data (or alternatively, more comfortable in rejecting it). If it's the latter, we should also check how easy it is to scrape NJ/KY data -- if there's an easily available better alternative, let's go for that instead.

mikemahoney218 commented 4 years ago

From #52 Adding code for #31 (I believe this closes the issue, but don't want to make that claim -- and if we decide we want to do more work on it, I think this PR should merge separately).

Summary: When looking at FBI data from 2014-2018, PA is typically the 13th-14th closest state to MA in terms of arrest rates, whether we look at raw arrest rates per capita or z-scale them.

Next steps: I think we need to decide if this satisfactorily addresses #31 or if we want more work in that vein.

If we want to spend more time deciding if PA is worth pursuing as a proxy, I think we could easily spend more time doing clustering with this dataset -- we've only looked at euclidian distance so far, which is only one measure of distance; running kNN or similar might turn up something interesting
If we decide we're done with #31, we then need to decide if we're confident pressing on with the PA data or want to examine other options
    If we're confident pressing on with PA, #23, #8, #9 are natural next steps
    If not, we should spend a little time looking into data availability for other states
mikemahoney218 commented 4 years ago

Closing this issue in favor of #58