Closed gregallensworth closed 6 years ago
A survey of some injury to fatality "conversion" crashes gives the following information, relevant to date spans:
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3649309
entered 5 days after crash date
updated almost a month later, changing 1 injury to a fatality
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3624204
entered 6 days after crash date
updated a month later, changing 1 injury to a fatality
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3626043
entered 4 days after crash date
updated 16 days later, changing 1 injury to a fatality
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3628393
entered 4 days after crash date
updated 6 weeks later, changing 1 injury to a fatality
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3630562
entered 4 days after crash date
updated 5 weeks later, changing 1 injury to a fatality
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3635027
entered 4 days after crash date
updated 2 weeks later, changing 1 injury to a fatality
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3645566
entered 5 days after crash date
updated 5 weeks later, changing 1 injury to a fatality
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3649309
entered 5 days after crash date
updated 4 weeks later, changing 1 injury to a fatality
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3649733
entered 3 days after crash date
updated 2 weeks later, changing 1 injury to a fatality
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3655558
entered 3 days after crash date
updated 2 weeks later, changing 1 injury to a fatality
https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3655948
entered 3 days after crash date
updated 2 weeks later, changing 1 injury to a fatality
Notes in regards to Socrata querying:
:updated_at > :created_at
would of course need to be combined with a date filter e.g. AND :updated_at > 2017-01-01 AND :updated_at < 2017-02-01
so as to not grab an ever-accumulating set of records.
:updated_at > :created_at
seems to return nearly-all records, as most records have update times seconds or days after the creation time. This could be due to NYPD workflow: save details, add a few more and save those details, repeat. Additionally, 3883 records are showing as updated since March 1 (today is March 5) despite there being no new crashes (see next note). Of these 3883 records:
At present (March 5 2018) there are no new crash records since February 27. The most recent crash is dated 2017-02-27 with a created_at
of 2018-03-03 and a updated_at
of about 3 seconds later. The next-most-recent is much the same: Feb 27, logged and then edited on March 3.
The docs state that these system fields are returned as Fixed Timestamp type, whereas date_trunc_ymd()
to standardize on the date component, returns a Floating Timestamp. No documentation exists for the Fixed Timestamp data type, beyond noting that it includes milliseconds and timezone (Z / UTC).
As of today (March 5), 15932 records have been updated within the last month (updated_at >= 2018-02-05
) of which only 182 were updated on a later date. Within the last 7 days (updated_at >= 2018-02-26
), 4399 records have been updated and only 49 were updated on a later date.
Tentative conclusions:
edited-date > created-date
brings down the number of records dramatically, as expectedNew ETL script function find_updated_killcounts()
On today's run it looks back 7 days, to 2018-02-27. This finds 63 records updated, of which 8 result in a tally change.
This should keep us in sync moving forward.
Let's change "looks back 7 days" to 30 days.
Looks like a single change here: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/6ae34c32e54fa9eac5259241c49b3810112c0909/main.py#L515
Done.
Client reports that our fatality counts are off for January -- we say 8, they say 12. I confirmed that with details of which CARTO records updated in this Google sheet.
these four records have new fatalities:
SOCRATA ID | CARTODB ID | Socrata_Injured | Socrata_killed | CARTO_injured | CARTO_killed |
---|---|---|---|---|---|
3822943 | 2172685 | 0 | 1 | 1 | 0 |
3829376 | 2201428 | 2 | 1 | 3 | 0 |
3832804 | 2203102 | 0 | 1 | 1 | 0 |
3833536 | 2206149 | 0 | 1 | 1 | 0 |
An easy way to address this would be to just ramp up the look back date a lot farther -- to 90.
Though it could be that our current 30-day look back would have been enough if in place in early February. Earlier QA that I did focused on Jan-2017-Dec-2017, so I hadn't assessed Jan 2018 accuracy
But to me most sensible and easy is to turn it up to 90, let it run for another month, and check again.
This change was made last night, and did indeed catch a few records edited back in January which have since been updated. In theory, having gone back that far to catch up, tomorrow's updates should be seen the day after tomorrow, so going back 90 days achieves nothing additional. Still, this is not troublesome to leave in place.
Keeping this open so we can look into it in another month and then again after a second month, to determine how well we are keeping in sync.
This is looking great!
Investigations in #12 have confirmed that records at SODA are being modified after the fact, sometimes as much as 3 months after the fact In particular, a crash having an injury "converted" into a fatality is causing variances in injuries (2-3 out of several thousand, per month; acceptable) and in fatalities (2-3 out of 10-15; significant)
Can a mechanism exist, to detect crashes which have been altered potentially 3 months after the fact, and update the CARTO record?
The sheer volume of data records, exceeds what can feasibly be done with SODA and Socrata using a pure brute force mechanism. The updatedat and created_at hidden system fields at Socrata may provide some mechanism for filtering for altered* records (entered_at does not equal updated_at) within a certain timeframe (last 3 months).