GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Check SODA back 1 month, filter against already-existing records #6

Closed gregallensworth closed 6 years ago

gregallensworth commented 6 years ago

Relevant to https://github.com/GreenInfo-Network/nyc-crash-mapper/issues/67

The current approach of filtering by the highest date already fetched, would miss any record "backlogged" after its actual crash date.

Realistically, this is observed to happen. Between Dec 12 2017 and Jan 8 2018, 214 crash records were not seen (out of 16872 records, so 1.27%).

A smarter but more complex system could be introduced:

More specific code-level changes:

gregallensworth commented 6 years ago

Results here are quite good.

Of the 214 missing crash records for Dec 12 2017 - Jan 8 2018, 211 were found and loaded with no surprises.

The other 3 are explicable anomalies, beyond the scope of this discussion.

These are the 3 anomalous records, present in SODA and also in CARTO, but not detected as being in CARTO by date filter:

['3823151', '2018-01-06T08:00:00Z', u'-73.77078', u'40.66696']
['3591031', '2018-01-02T20:45:00Z', u'-74.14612', u'40.633747']
['3590187', '2018-01-01T13:15:00Z', u'-73.773476', u'40.679718']

The already-existing CARTO records are as follows: SELECT socrata_id, ST_ASTEXT(the_geom), date_val FROM crashes_all_prod WHERE socrata_id IN (3823151,3591031,3590187)

2017-01-01T13:15:00Z    3590187     null                                # no geom, date was Jan 1 2017, but later Jan 1 2018?
2017-01-02T20:45:00Z    3591031     null                                # no geom, date was Jan 2 2017, but later Jan 2 2018?
2017-12-06T08:00:00Z    3823151     POINT(-73.77078 40.66696)           # date was changed from Dec 6 to Jan 6

Looking closely at the cartodb_id serial numbers for these records compared to other records immediately adjacent to them, this is due to bad data coming out of Socrata.

Example: Record with socrata_id = 3591031

The record with socrata_id = 3591031 has cartodb_id=2017411 The SODA entry for this record seems to have changed date from 2017-01-02 to 2018-01-02.

Checking record with cartodb_id=2017411 against other records inserted around the same time, we see: image

Other records loaded around that same time were also dated January 2017.

Therefore, the change of this record's date from 2017-01-02 to 2018-01-02 must be erroneous at the City of NY data website. A crash back at that point in history (9 months ago) cannot possibly have happened happened 2018-01-02, such that we would expect that someone has corrected a bad record. Someone at NYC changed the record a year later. That is unfortunate and unexpected, but also beyond the scope of the CrashMapper.

gregallensworth commented 6 years ago

Noted in https://github.com/GreenInfo-Network/nyc-crash-mapper-chart-view/issues/74 that some records were being inserted with inconsistent borough values, e.g. "BRONX" instead of "Bronx" The ETL script was expected to account for these...

Investigation uncovered the very likely issue, and this would be relevant to the "look back a month" changed brought here in #6

create_sql_insert() inserts the borough names as-given from SODA. There are two followup functions which would correct borough names: update_borough() (assign a borough values based on spatial containment of the crash point with a borough polygon) and normalizeBoroughSpellings() (corrects records based on their string value, e.g. ilike %queens%)

However, both of these contain a filter by LATEST_DATE which is the highest date found in the table before the ETL process was started. As such, loading of a "backdated" record from an earlier date, would be skipped by both of these updates.

Recommendation:

gregallensworth commented 6 years ago

The updated ETL script has been deployed to Heroku, and should run tomorrow. Let's check on it tomorrow just for due diligence.

Current MAX(cartodb_id) is 2202693

So tomorrow this will bring up anything new: SELECT * FROM crashes_all_prod WHERE cartodb_id > 2202693

We should see 200ish new records (a gross average there), with boroughs and precincts and council districts filled in, etc. A couple may fall outside the boroughs and yet within the boroughs-bbox and have a blank borough, but only a couple.

gregallensworth commented 6 years ago

The Heroku scheduler does show that the task ran, however no new records were found.

I ran the task manually from command-line, and it ran just fine.

So, still need to figure out what's going on with this thing when Heroku runs it.

The new MAX(cartodb_id) is 2204429 Thus: SELECT * FROM crashes_all_prod WHERE cartodb_id > 2204429

gregallensworth commented 6 years ago
gregallensworth commented 6 years ago

Commit 3d39b6c should fix this, by adding add python-dateutils to the requirements.txt

This same commit also adds a runtime.txt as recommended by Heroku, to explicitly use Python 2.7.14. This silences warnings of 2.7.13 being a default and being unsupported.

gregallensworth commented 6 years ago

Looking good. The ETL did run last night, and picked up 617 new records since yesterday.

image

This does seem unusually high compared to some prior estimates of "200ish" crashes per day, so I checked the statistics (see above).

January 20 was a particularly bad day for NYC crashes: 481 crashes. There were logged overnight between Jan 23 and Jan 24, some 4 days later than they occurred.

Additionally, 58 new crashes from Jan 19 were picked up, 24 from Jan 18, and so forth tapering downward over the last two weeks.

This reinforces previous suspicions that the "since yesterday" behavior was missing a significant number of crashes which were backlogged by more than a day, and seems to indicate that a delay of 4-7 days is quite common.

As such, client should be quite pleased that moving forward we are picking up a significant number of crashes that were previously missed.

gregallensworth commented 6 years ago

Reopening for further work. Per issue #12 it was discovered that records may a) be updated after the fact, and may be recognized with the updated_at hidden system field; and b) entered not only 1 month after the fact but nearly three months after the fact.

gregallensworth commented 6 years ago

A sampling of known-variant records gave the following results. These seem to indicate that a record being entered a month (or three, in that one case) is quite rare, and that a week or less is more representative.

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3649309
entered 5 days after crash date
updated almost a month later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3624204
entered 6 days after crash date
updated a month later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3626043
entered 4 days after crash date
updated 16 days later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3628393
entered 4 days after crash date
updated 6 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3630562
entered 4 days after crash date
updated 5 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3635027
entered 4 days after crash date
updated 2 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3645566
entered 5 days after crash date
updated 5 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3649309
entered 5 days after crash date
updated 4 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3649733
entered 3 days after crash date
updated 2 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3655558
entered 3 days after crash date
updated 2 weeks later, changing 1 injury to a fatality

https://data.cityofnewyork.us/resource/qiz3-axqb.json?$select=:*,%20*&$where=unique_key%3D3655948
entered 3 days after crash date
updated 2 weeks later, changing 1 injury to a fatality

As such, scraping back 2 or 3 months is unlikely to catch many new records, as the recent 1-month window is quite effective. Still, there is no harm in making this change and it will catch the rare outlier.