Closed danrademacher closed 6 years ago
In terms of reloading and reprocessing the entire dataset, the best person to ask about that is probably the contractor from Serbia that Christine hired via UpWork to do the data wrangling of the NYC Open Data and the historical summary NYPD data from August, 2011+ From what I recall, he wrote a program in Java to parse and load the data into a local Postgres/PostGIS database, so that code may not be helpful to us anyway. I remember most of the problems he encountered having to do with even older data that Transportation Alternatives had previously used, but that was not compatible with the data we're currently using.
When the contractor had finished that task, he delivered to me a Postgres/PostGIS database extract. To get that data into CARTO, I had to export several CSV files of the finalized crashes
table, import each CSV file into CARTO, join the imported files into a single table, and then coerce the fields for contributing_factor
and vehicle_type
into Postgres string array data types. That last step is documented in ./sql/2016_data_update.sql
, but may not be needed.
In terms of updating a single year of data in CARTO, that could be accomplished by either downloading the data from Socrata through their Dashboard, or programmatically through their API, then processing the data so that it conforms with the crashes_all_prod
table on CARTO, and then running an UPDATE
SQL query to replace it for that year. The functions within the ETL script describe how data is processed from the Socrata API so that it conforms with the table on CARTO. It's fairly straight forward and written in Python, so could probably be refactored into a separate script that updates data for a single year. The only limitation I foresee there is that both the Socrata API and the CARTO SQL API have limits on how much data you can request or POST in a single API call.
Great. Thanks. Super helpful to have it all one place here.
Seems like a potential middle ground would be to prioritize certain years and reload only those, like the last 3 or something. Would match the RANK view and ensure that's aligned with NYC DOT numbers.
One more question, @clhenrick. In the README on this repo, we say:
Prior to importing the data into CARTO, rows lacking lat lon with potentially valid address information were attempted to be geocoded using the NYC GeoClient API. For example, if a row lacked values for lat lon, but contained values for cross streets and borough or zip code, the NYC GeoClient intersection endpoint was used to geocode the crash location.
Is that use of the NYC geocoder included in ETL? On a short review, I think it's not, but wanted to check. Perhaps that was done by the Upwork contractor?
From what I recall this didn't work in most instances, and that the contractor was only able to geocoded a handful of crashes. It had to do with missing street prefixes and/or suffixes, which are required for the NYC GeoClient API to geocode intersections.
OK, in that case, not worth worrying about that separate geocode run.
OK, a few more notes here:
As of Oct 2017 (most recent complete month), reloading the dataset would involve processing 672,558 total crashes (per citywide stats at the bottom of the map, with these filters applied.
We don't know what the limit is. Seems like some sort of streaming approach would be best
The ETL script uses the MAX(date_val)
from crashes_all_prod to construct a query to SODA, so as to request only new records since the latest update. Typically, this runs each day, so means new SODA records which have come up within the last day.
Client is concerned that some crashes may not be reported in sequence, and may be reported days after the fact. As a result, the ETL script running on January 8 would log crashes dated January 7, but would not log a January 6 crash which was added to SODA after the January 7 crashes were loaded.
Goal: Examine SODA and the crashes_all_prod table, determine whether this is a realistic problem.
Goal: If there are a significant number of crashes missing:
Run on January 11 2018, the latest crash logged in CARTO and displayed in SODA agree: 2018-01-08 is the latest, and there were 585 crashes on that date.
A SODA dump was obtained for December 12 2017 - Jan 8 2018, at https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=date%20%3E=%20%272017-12-12T00:00:00.000%27&$order=date%20DESC&$limit=20000
SODA dump: 16872 records CARTO table: 16658 records
Thus, 214 crashes (1.27%) in SODA were not present in CARTO, in this time period.
Correlating CARTO and SODA records: The SODA records have a unique_key field, which corresponds to the socrata_id field in CARTO. (The CARTO table does have a unique_key field, but this is not used.) Therefore, it should be an easy matter to find records in a SODA dataset which are absent from the CARTO dataset.
The SODA API parameters are well documented, the three of most immediate interest being:
date >= '2017-12-31'
used to filter recordsdate DESC
1000
, but also seems to work with much larger values such as 250000
An additional concern which inspires this discussion of "reloading" the data, is that a crash's injury and/or fatality counts may change after the crash was initially displayed in SODA and loaded into CARTO. An example could be an injury not detected for some weeks, or an injury which resulted in death weeks later.
The existence and preponderance of such later-altered records is yet to be determined.
@gregallensworth also keep in mind that the ETL script currently filters out "bad data", e.g. any crash with a lat lon outside of the bbox of NYC. It doesn't seem that this filter would account for the whole 214 crash difference though.
I know we talked about this live, but I'd like to capture here for the record what you know about the task of reloading and reprocessing the dataset from NYC Open Data at some regular interval to capture any old crashes that might get added or updated by NYC after our regular ETL scripts run.