GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

ETL adjustment or offline task, to find & rectify altered crash records? #13

Closed gregallensworth closed 6 years ago

gregallensworth commented 6 years ago

Investigations in #12 have confirmed that records at SODA are being modified after the fact, sometimes as much as 3 months after the fact In particular, a crash having an injury "converted" into a fatality is causing variances in injuries (2-3 out of several thousand, per month; acceptable) and in fatalities (2-3 out of 10-15; significant)

Can a mechanism exist, to detect crashes which have been altered potentially 3 months after the fact, and update the CARTO record?

The sheer volume of data records, exceeds what can feasibly be done with SODA and Socrata using a pure brute force mechanism. The updatedat and created_at hidden system fields at Socrata may provide some mechanism for filtering for altered* records (entered_at does not equal updated_at) within a certain timeframe (last 3 months).

gregallensworth commented 6 years ago

A survey of some injury to fatality "conversion" crashes gives the following information, relevant to date spans:

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3649309
entered 5 days after crash date
updated almost a month later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3624204
entered 6 days after crash date
updated a month later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3626043
entered 4 days after crash date
updated 16 days later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3628393
entered 4 days after crash date
updated 6 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3630562
entered 4 days after crash date
updated 5 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3635027
entered 4 days after crash date
updated 2 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3645566
entered 5 days after crash date
updated 5 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3649309
entered 5 days after crash date
updated 4 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3649733
entered 3 days after crash date
updated 2 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3655558
entered 3 days after crash date
updated 2 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3655948
entered 3 days after crash date
updated 2 weeks later, changing 1 injury to a fatality
gregallensworth commented 6 years ago

Notes in regards to Socrata querying:

Tentative conclusions:

gregallensworth commented 6 years ago

New ETL script function find_updated_killcounts()

On today's run it looks back 7 days, to 2018-02-27. This finds 63 records updated, of which 8 result in a tally change.

This should keep us in sync moving forward.

danrademacher commented 6 years ago

Let's change "looks back 7 days" to 30 days.

Looks like a single change here: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/6ae34c32e54fa9eac5259241c49b3810112c0909/main.py#L515

gregallensworth commented 6 years ago

Done.

danrademacher commented 6 years ago

Client reports that our fatality counts are off for January -- we say 8, they say 12. I confirmed that with details of which CARTO records updated in this Google sheet.

these four records have new fatalities:

SOCRATA ID CARTODB ID Socrata_Injured Socrata_killed CARTO_injured CARTO_killed
3822943 2172685 0 1 1 0
3829376 2201428 2 1 3 0
3832804 2203102 0 1 1 0
3833536 2206149 0 1 1 0

An easy way to address this would be to just ramp up the look back date a lot farther -- to 90.

Though it could be that our current 30-day look back would have been enough if in place in early February. Earlier QA that I did focused on Jan-2017-Dec-2017, so I hadn't assessed Jan 2018 accuracy

But to me most sensible and easy is to turn it up to 90, let it run for another month, and check again.

gregallensworth commented 6 years ago

This change was made last night, and did indeed catch a few records edited back in January which have since been updated. In theory, having gone back that far to catch up, tomorrow's updates should be seen the day after tomorrow, so going back 90 days achieves nothing additional. Still, this is not troublesome to leave in place.

Keeping this open so we can look into it in another month and then again after a second month, to determine how well we are keeping in sync.

gregallensworth commented 6 years ago

This is looking great!