Custom SODA run for 2015-01-01 to present?

GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.

3 stars 0 forks source link

Custom SODA run for 2015-01-01 to present? #7

Closed gregallensworth closed 6 years ago

gregallensworth commented 6 years ago

Relevant to https://github.com/GreenInfo-Network/nyc-crash-mapper/issues/67

The solution described in #6 should prevent future data loading from skipping over late-entered crash records. Now the question is back-loading these missing crashes.

At 500-600 records per month (gross assumption that Dec 2017 was average), the last 36 months may be missing 20,000 crashes (out of the 1.2 million, about 2%).

A custom utility could be created to repeatedly download Socrata records and merge them, fetch the list of socrata_id entries in CARTO, and generate a JSON file of the SODA records which are missing from CARTO. This JSON file could then be loaded following techniques outlined in the existing ETL script.

danrademacher commented 6 years ago

This would be great and has been approved as GO by client in terms of budget.

In case it is useful reference: On a separate thread with here, I developed this Google sheet looking overall 2017 data from NYC vs our database: https://docs.google.com/spreadsheets/d/1R3DW3rdKXZP5zv4pQH8vsv0MSDEe5PY1bnmGJgakdnU/edit#gid=187881358

Main insight that may be new there is that in the source data persons_injured and persons_killed are often less than the sum of cyclists_injured,pedestrians_injured, and motorists_injured (and same for mode_killed

Oh and the ability to use SQL API endpoints directly from app plus format=csv to pull into gSheets is super awesome for data QA!

gregallensworth commented 6 years ago

I have a "backlog script" working, and results so far are quite good.

In the first 3 months of 2015:

I am seeing 26-30 new crashes per month, not the 500ish previously indicated by comparing CARTO results (it won't always return a full list of IDs, as it reaches query limits).
"Several" of them fall within NYC proper, and are being tagged by NYPD, Borough, Neighborhood, etc.
"Several" lack location coordinates at all.
The large majority (30% to 80%) are outside of NYC proper, by virtue of being outside of the seven-borough area (Valley Stream, Floral Park, Bellerose) or else on bridges not covered by a borough, NYPD, etc.

2015-03     31 new crashes, 5 in NYC proper, 6 lack location, others confirmed as bridges or outside study area
2015-02     27 new crashes, 5 lack location, 1 within NYC proper, others confirmed outside the study area or on bridges outside any specific area
2015-01     26 new crashes, 8 within NYC proper, rest confirmed on bridge so not in any specific area

gregallensworth commented 6 years ago

Still chugging along. Trends into 2016...

2016 January - March (3 months)

119 new crashes
5 no geometry
61 on bridges or outside study area

2016 April - September (6 months)

263 new crashes
21 had no coordinate geometry
122 confirmed outside area or on bridges

gregallensworth commented 6 years ago

Starting in 2017 the umber of "missing" crashes increased sharply. This may be due to loading being performed exclusively by the ETL script (with the "more recent than yesterday" issue) as opposed to previous data being collected and prepared through other means.

2017-01 365 new crashes 2017-02 551 new crashes 2017-03 1444 new crashes 2017-04 1169 new crashes 2017-05 1515 new crashes 2017-06 2166 new crashes 2017-07 2019 new crashes 2017-08 1902 new crashes 2017-09 1986 new crashes

Total for Q1 through Q3 of 2017:

13117 new crashes
11783 coded and polygon-matched OK
773 lacking geometry coordinates
561 outside of area or on bridge

gregallensworth commented 6 years ago

Loading done.

Date range was 2015-01 through 2017-12, this being 36 months.

The MAX(cartodb_id) before all of this was 2214408 so, a query may be made to isolate only these newly-added records from "the backlog"

Summary stats:

17245 new crashes added
15050 geocoded and fell into NYPD, borough, etc.
1045 have no coordinate geometry
1150 have coordinate geometry but fell outside area or on bridges

Commit 4c10e39 is the new backlog-loader script, for reference.

danrademacher commented 6 years ago

This is great!