18F / crime-data-explorer

Moved to https://github.com/fbi-cde
73 stars 20 forks source link

Weird NIBRS participation dip in 2015 #307

Closed harrisj closed 6 years ago

harrisj commented 6 years ago

So, for some reason, it seems like almost all agencies do not have a full 12 months of NIBRS reporting in 2015. This might be a bug or a data integrity issue or something else, but the effect is quite dramatic

year participating_agencies nibrs_participating_agencies
2016 19,296 7,581
2015 19,319 5
2014 18,660 6,960
2013 18,327 6,649
2012 18,607 6,723
2011 18,616 6,200
2010 18,237 6,021
2009 17,741 6,170
2008 17,402 5,501
2007 17,211 5,277
2006 16,585 4,987
2005 15,760 4,764
2004 15,903 4,812
2003 15,023 4,387
2002 14,792 3,701
2001 14,043 3,528
2000 13,542 3,195
1999 12,514 2,586
1998 12,209 2,352
1997 11,915 1,619
1996 12,266 1,216
1995 11,966 983
1994 12,610 983
1993 12,433 1,073
1992 12,883 816
1991 12,439 573
1990 13,739 0
harrisj commented 6 years ago

So this is how agencies reported NIBRS months in 2015

nibrs_months_reported count
1 1,017
2 537
3 378
4 253
5 218
6 206
7 149
8 125
9 107
10 112
11 109
12 5

And this is 2016

nibrs_months_reported count
1 45
2 42
3 42
4 31
5 48
6 36
7 39
8 48
9 50
10 76
11 131
12 6,331
harrisj commented 6 years ago

And zooming in further, here is the nibrs_month table for Hoover, AL in 2015

agency_id data_year month_num reported_status
61 2,015 1 I
61 2,015 2 I
61 2,015 3 I
61 2,015 4 I
61 2,015 5 I
61 2,015 6 I
61 2,015 7 I
61 2,015 8 U
61 2,015 9 U
61 2,015 10 I
61 2,015 11 U
61 2,015 12 U

I = incident report, U = unreported

harrisj commented 6 years ago

Hoover seems to have a complete set of nibrs_month records, but the larger issue seems to be an overall drop in the number of records in the nibrs_month table for that year

data_year count
2013 267,984
2014 269,412
2015 112,860
2016 271,728
LarryBafundo commented 6 years ago

@harrisj this seems like a separate issue from what we were discussing earlier - that the business line officially determines which states reports NIBRS based on the state of their respective programs. this should be reflected on the "how states report" map on the about page: https://crime-data-explorer.fr.cloud.gov/about.

But do you have a sense what is behind these dips in participation? Is it something we need to explore further

cc: @jpwentz

jpwentz commented 6 years ago

PER UCR:

select count(distinct RA.agency_id) from NIBRS_MONTH NM JOIN REF_AGENCY RA ON RA.AGENCY_ID = NM.AGENCY_ID WHERE NM.DATA_YEAR = 2015 and RA.AGENCY_STATUS = 'A' AND NM.REPORTED_STATUS IN ( 'I', 'Z') 6673 U 16576 I 6444 Z 1098

harrisj commented 6 years ago

@jpwentz can you rerun that so it just gives me a count of total nibrs_month records in 2015? The issue is not that agencies aren't in there at all, but many are missing months of representation in nibrs_month in 2015

jpwentz commented 6 years ago

I will request that from Justin -- I think CDE concept of participation is not accurate. It looks like that should be defined by the status in the REFA.csv.

LarryBafundo commented 6 years ago

@jpwentz can you elaborate on what you mean by that?

harrisj commented 6 years ago

Currently, we define participation by an agency in a given year as "reported for 12 months or was covered by an agency that reported for 12 months". We have participation metrics for RETA (using reta_month) and NIBRS (using nibrs_month). Any feedback would be welcome if there is a different way we should be doing that.

jpwentz commented 6 years ago

5954 12 months for I or Z coming from nibrs_month table in UCR

select COUNT(*) FROM ( select distinct RA.ORI from NIBRS_MONTH NM JOIN REF_AGENCY RA ON RA.AGENCY_ID = NM.AGENCY_ID WHERE NM.DATA_YEAR = 2015 and RA.AGENCY_STATUS = 'A' AND NM.REPORTED_STATUS IN ( 'I', 'Z') HAVING SUM(NM.MONTH_NUM) = 78 GROUP BY RA.ORI )

jpwentz commented 6 years ago

select distinct ra.ori from NIBRS_MONTH NM JOIN REF_AGENCY RA ON RA.AGENCY_ID = NM.AGENCY_ID WHERE NM.DATA_YEAR = XXXX and RA.AGENCY_STATUS = 'A' AND NM.REPORTED_STATUS IN ( 'I', 'Z')

2013 - 6423 2014 - 6628 2015 - 3216 2016 - 6919

We need to get an answer on how to determine this information -- We can use the above script and have it update the agency_participation table.

@harrisj your thoughts?

Running the same script in CDE: select COUNT(*) FROM ( select distinct RA.ORI from NIBRS_MONTH NM JOIN REF_AGENCY RA ON RA.AGENCY_ID = NM.AGENCY_ID WHERE NM.DATA_YEAR = 2015 and RA.AGENCY_STATUS = 'A' AND NM.REPORTED_STATUS IN ( 'I', 'Z') GROUP BY RA.ORI HAVING SUM(NM.MONTH_NUM) = 78 ) as foo

I get 5 !

jpwentz commented 6 years ago

2015 only has report status of U and I --- 2016 has report status of U I and Z

harrisj commented 6 years ago

Excellent. So we've confirmed the 2015 data is truncated in some fashion, although it is still to be determined if it's because the CSV was incomplete or there was an issue loading it

jpwentz commented 6 years ago

2015 NIBRS_M csv deliver to 18F contained NIBRS_M rows with report_status of Z -- but were not ingested.

GIT Repo history does not show this file changing but may not have existed in the repo until after the 2015 data was ingested. It will be difficult to determine the deltas from 2015 script to 2016 script.

I have a copy of the data and it can be re-ingested at any point.

jpwentz commented 6 years ago

Also note that 2015 Data has dates defined for rows of report_status Z and 2016 does not have those date columns populated -- I would assume the failure occurred here.

LarryBafundo commented 6 years ago

@jpwentz got it; if our immediate priorities are to ensure that the data displayed by the CDE is accurate should we talk to Cindy about prioritizing the reimport of the 2015 data relatively soon? do you need assistance from us to do that?

jpwentz commented 6 years ago

@LarryBafundo It will not be all of the 2015 data -- it will only be the tables that utilize NIBRS_MONTH. It would be helpful to have some insight on which tables those are -- and then assistance with creating the subset script for just ingesting that.

harrisj commented 6 years ago

AFAIK, this would mean that we reload those fields for 2015 in nibrs_month but that we then have to rebuild a few tables that depend on nibrs_participation:

I think the other tables are probably fine since some of them use the "current year" and I think that is 2016 which would be unchanged.

harrisj commented 6 years ago

So I kinda dropped the ball on following up on this one. Do we have a new file to try loading?

LarryBafundo commented 6 years ago

@jpwentz is this something you can take on? its a pretty visible issue on production.

LarryBafundo commented 6 years ago

@jpwentz want to catch-up on this?

Screen Shot 2017-11-27 at 4.23.31 PM.png

jpwentz commented 6 years ago

@LarryBafundo As mentioned previously we need to re-ingested the 2015 NIBRSM data. It looks like it failed during ingestion based upon my above comments. I will work with @harrisj on creating the scripts to just handle importing agencies.

LarryBafundo commented 6 years ago

thanks @jpwentz

harrisj commented 6 years ago

With the latest data from CJIS, this has been fixed

Year Agencies NIBRS Agencies
2016 19296 7420
2015 19319 7262
2014 18660 6960
2013 18327 6649
LarryBafundo commented 6 years ago

yay!

harrisj commented 6 years ago

To follow up, I have rebuilt the following downloads agencies.csv and ucr_participation.csv and uploaded to S3. We could potentially rebuild all the NIBRS bulk downloads and reupload, but not as urgent, since those don't cover 2015 and 2016 yet.