18F / crime-data-explorer

Moved to https://github.com/fbi-cde
73 stars 20 forks source link

Explore value & feasibility of making bulk downloads time series #321

Closed LarryBafundo closed 6 years ago

LarryBafundo commented 6 years ago

We heard that users find a single year's view of the data to be limiting and value a historical view to facilitate analysis. Having a file that aggregates all of the available incident data for a given state would also make this data easier to work with.

jeremiak commented 6 years ago

Started to pull all West Virginia NIBRS downloads together from 2008 through 2016 here: https://github.com/18F/crime-data-prototypes/tree/master/demos/multi-year-nibrs

harrisj commented 6 years ago

Yeah, it gets large. Since 2006, there have 743,637 incidents in NIBRS, with 823,195 offenders for instance. I feel like for a state like California or Texas it would 2-3x as large

harrisj commented 6 years ago

We could also continue segmented NIBRS by offense family if individual files/datasets get too large. Here are the NIBRs offense families

Offense Family
Arson
Assault Offenses
Burglary/Breaking & Entering
Counterfeiting/Forgery
Destruction/Damage/Vandalism of Property
Drug/Narcotic Offenses
Embezzlement
Extortion/Blackmail
Fraud Offenses
Gambling Offenses
Homicide Offenses
Kidnapping/Abduction
Larceny/Theft Offenses
Motor Vehicle Theft
Pornography/Obscene Material
Prostitution Offenses
Robbery
Sex Offenses
Stolen Property Offenses
Weapon Law Violations

But that would mean having many files of smaller size rather than a few of bigger ones

LarryBafundo commented 6 years ago

thanks; let's see what we can learn from the new format and testing this week and then we can explore other ways of making it available in the future. i think you're right that trying to do everything in one file without some kind of partitioning isn't going to be sustainable, so maybe we do what you're suggesting instead.

LarryBafundo commented 6 years ago

some additional questions to explore if we still want to move in this direction.

--how big is too big when it comes to file size? are there clear limits to what our users can download and work with? how might we test this? --if we want to reduce file size by breaking one large file into smaller, more manageable pieces, what partitioning strategy makes the most sense (e.g. crimes against persons vs. offense type)? --how might our partitioning strategy affect the generation and maintainability of these files? --would partitioned files be harder to work with or increase the likelihood of miscounts and user error? how might we test this?

Will move this issue to the backlog for now, as we first need to get the content right before we consider how to package it.

cc: @harrisj, @jeremiak

LarryBafundo commented 6 years ago

this is an ongoing question that is somewhat dependent on the following:

https://waffle.io/18F/crime-data-explorer/cards/5a1ca85f7fc9aa0121da7a6b

https://waffle.io/18F/crime-data-explorer/cards/5a1ca85f7fc9aa0121da7a6b

Before we decide on how we want to package this information (single year or time series) we need to figure out how we should be working with/counting this data in the first place. Then we need to weigh the costs/benefits of providing the data in a way that promotes flexibility and the value of NIBRS with potentially passing on complexity to our consumers.

LarryBafundo commented 6 years ago

we still need to explore this, both in terms of the temporary and longterm solutions

LarryBafundo commented 6 years ago

in the interest of getting a short-term fix ready ASAP, we're going to with a fully normalized, single year approach for now. we should consider the feasibility of a time series & denormalized approach in the future. closing for now.