Closed gregallensworth closed 6 years ago
This would be great and has been approved as GO by client in terms of budget.
In case it is useful reference: On a separate thread with here, I developed this Google sheet looking overall 2017 data from NYC vs our database: https://docs.google.com/spreadsheets/d/1R3DW3rdKXZP5zv4pQH8vsv0MSDEe5PY1bnmGJgakdnU/edit#gid=187881358
Main insight that may be new there is that in the source data persons_injured
and persons_killed
are often less than the sum of cyclists_injured
,pedestrians_injured
, and motorists_injured
(and same for mode_killed
Oh and the ability to use SQL API endpoints directly from app plus format=csv
to pull into gSheets is super awesome for data QA!
I have a "backlog script" working, and results so far are quite good.
In the first 3 months of 2015:
2015-03 31 new crashes, 5 in NYC proper, 6 lack location, others confirmed as bridges or outside study area
2015-02 27 new crashes, 5 lack location, 1 within NYC proper, others confirmed outside the study area or on bridges outside any specific area
2015-01 26 new crashes, 8 within NYC proper, rest confirmed on bridge so not in any specific area
Still chugging along. Trends into 2016...
2016 January - March (3 months)
2016 April - September (6 months)
Starting in 2017 the umber of "missing" crashes increased sharply. This may be due to loading being performed exclusively by the ETL script (with the "more recent than yesterday" issue) as opposed to previous data being collected and prepared through other means.
2017-01 365 new crashes 2017-02 551 new crashes 2017-03 1444 new crashes 2017-04 1169 new crashes 2017-05 1515 new crashes 2017-06 2166 new crashes 2017-07 2019 new crashes 2017-08 1902 new crashes 2017-09 1986 new crashes
Total for Q1 through Q3 of 2017:
Loading done.
Date range was 2015-01 through 2017-12, this being 36 months.
The MAX(cartodb_id)
before all of this was 2214408 so, a query may be made to isolate only these newly-added records from "the backlog"
Summary stats:
Commit 4c10e39 is the new backlog-loader script, for reference.
This is great!
Relevant to https://github.com/GreenInfo-Network/nyc-crash-mapper/issues/67
The solution described in #6 should prevent future data loading from skipping over late-entered crash records. Now the question is back-loading these missing crashes.
At 500-600 records per month (gross assumption that Dec 2017 was average), the last 36 months may be missing 20,000 crashes (out of the 1.2 million, about 2%).
A custom utility could be created to repeatedly download Socrata records and merge them, fetch the list of socrata_id entries in CARTO, and generate a JSON file of the SODA records which are missing from CARTO. This JSON file could then be loaded following techniques outlined in the existing ETL script.