Documentation of known issues with reloading complete dataset

danrademacher commented 6 years ago

I know we talked about this live, but I'd like to capture here for the record what you know about the task of reloading and reprocessing the dataset from NYC Open Data at some regular interval to capture any old crashes that might get added or updated by NYC after our regular ETL scripts run.

clhenrick commented 6 years ago

In terms of reloading and reprocessing the entire dataset, the best person to ask about that is probably the contractor from Serbia that Christine hired via UpWork to do the data wrangling of the NYC Open Data and the historical summary NYPD data from August, 2011+ From what I recall, he wrote a program in Java to parse and load the data into a local Postgres/PostGIS database, so that code may not be helpful to us anyway. I remember most of the problems he encountered having to do with even older data that Transportation Alternatives had previously used, but that was not compatible with the data we're currently using.

When the contractor had finished that task, he delivered to me a Postgres/PostGIS database extract. To get that data into CARTO, I had to export several CSV files of the finalized crashes table, import each CSV file into CARTO, join the imported files into a single table, and then coerce the fields for contributing_factor and vehicle_type into Postgres string array data types. That last step is documented in ./sql/2016_data_update.sql, but may not be needed.

In terms of updating a single year of data in CARTO, that could be accomplished by either downloading the data from Socrata through their Dashboard, or programmatically through their API, then processing the data so that it conforms with the crashes_all_prod table on CARTO, and then running an UPDATE SQL query to replace it for that year. The functions within the ETL script describe how data is processed from the Socrata API so that it conforms with the table on CARTO. It's fairly straight forward and written in Python, so could probably be refactored into a separate script that updates data for a single year. The only limitation I foresee there is that both the Socrata API and the CARTO SQL API have limits on how much data you can request or POST in a single API call.

danrademacher commented 6 years ago

Great. Thanks. Super helpful to have it all one place here.

danrademacher commented 6 years ago

Seems like a potential middle ground would be to prioritize certain years and reload only those, like the last 3 or something. Would match the RANK view and ensure that's aligned with NYC DOT numbers.

danrademacher commented 6 years ago

One more question, @clhenrick. In the README on this repo, we say:

Prior to importing the data into CARTO, rows lacking lat lon with potentially valid address information were attempted to be geocoded using the NYC GeoClient API. For example, if a row lacked values for lat lon, but contained values for cross streets and borough or zip code, the NYC GeoClient intersection endpoint was used to geocode the crash location.

Is that use of the NYC geocoder included in ETL? On a short review, I think it's not, but wanted to check. Perhaps that was done by the Upwork contractor?

clhenrick commented 6 years ago

From what I recall this didn't work in most instances, and that the contractor was only able to geocoded a handful of crashes. It had to do with missing street prefixes and/or suffixes, which are required for the NYC GeoClient API to geocode intersections.

danrademacher commented 6 years ago

OK, in that case, not worth worrying about that separate geocode run.

danrademacher commented 6 years ago

OK, a few more notes here:

As of Oct 2017 (most recent complete month), reloading the dataset would involve processing 672,558 total crashes (per citywide stats at the bottom of the map, with these filters applied.

How does that relate to API rate limits? Hmm, Carto rate limits seem almost prohibitive here: https://carto.com/docs/faqs/carto-engine-usage-limits/#sql-api-limits. We currently load data daily. Testing for a day on Socrata: https://data.cityofnewyork.us/Public-Safety/Untitled-Visualization-Based-on-NYPD-Motor-Vehicle/mx9c-7np8/. So that's 1375 crashes. We know that works as a single process.

We don't know what the limit is. Seems like some sort of streaming approach would be best

gregallensworth commented 6 years ago

Missing Past Data: SODA vs crashes_all_prod table

The ETL script uses the MAX(date_val) from crashes_all_prod to construct a query to SODA, so as to request only new records since the latest update. Typically, this runs each day, so means new SODA records which have come up within the last day.

Client is concerned that some crashes may not be reported in sequence, and may be reported days after the fact. As a result, the ETL script running on January 8 would log crashes dated January 7, but would not log a January 6 crash which was added to SODA after the January 7 crashes were loaded.

Goal: Examine SODA and the crashes_all_prod table, determine whether this is a realistic problem.

Goal: If there are a significant number of crashes missing:

Improve the ETL to accommodate data being loaded into SODA potentially weeks after the fact, to minimize this phenomenon.
Come up with a mechanism for back-loading these missing crashes, to fill in gaps in the past.

Comparison: Dec 12 2017 - Jan 8 2018

Run on January 11 2018, the latest crash logged in CARTO and displayed in SODA agree: 2018-01-08 is the latest, and there were 585 crashes on that date.

A SODA dump was obtained for December 12 2017 - Jan 8 2018, at https://data.cityofnewyork.us/resource/qiz3-axqb.json?$where=date%20%3E=%20%272017-12-12T00:00:00.000%27&$order=date%20DESC&$limit=20000

SODA dump: 16872 records CARTO table: 16658 records

Thus, 214 crashes (1.27%) in SODA were not present in CARTO, in this time period.

SODA API, CARTO, and Correlation

Correlating CARTO and SODA records: The SODA records have a unique_key field, which corresponds to the socrata_id field in CARTO. (The CARTO table does have a unique_key field, but this is not used.) Therefore, it should be an easy matter to find records in a SODA dataset which are absent from the CARTO dataset.

The SODA API parameters are well documented, the three of most immediate interest being:

where -- a clause such as date >= '2017-12-31' used to filter records
order -- a sorting specification such as date DESC
limit -- max number of records to return; defaults to 1000, but also seems to work with much larger values such as 250000

gregallensworth commented 6 years ago

Updating Crash Records: Injury & Fatality Counts Changed Later

An additional concern which inspires this discussion of "reloading" the data, is that a crash's injury and/or fatality counts may change after the crash was initially displayed in SODA and loaded into CARTO. An example could be an injury not detected for some weeks, or an injury which resulted in death weeks later.

The existence and preponderance of such later-altered records is yet to be determined.

Take a few SODA extracts and CARTO extracts for comparison of their crash count, total fatality count, and total injury count. Determine what differences exist, and whether they appear to have been caused by records being updated later.
If so, then invent a mechanism to fetch SODA and CARTO records for a given time period, and update accordingly if the tallies have changed.

clhenrick commented 6 years ago

@gregallensworth also keep in mind that the ETL script currently filters out "bad data", e.g. any crash with a lat lon outside of the bbox of NYC. It doesn't seem that this filter would account for the whole 214 crash difference though.

GreenInfo-Network / nyc-crash-mapper