GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

no data since 11/16 #18

Closed danrademacher closed 5 years ago

danrademacher commented 5 years ago

Hmm, client reports and I just confirmed in https://chekpeds.carto.com/tables/crashes_all_prod that we have no crashes since Nov 16, 2018.

Looking at the logs at https://dashboard.heroku.com/apps/nyc-crash-mapper-etl/logs, I can see that the scheduled task is running.

Ah, here's a new one, NYC data has nothing since 11/16. https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95/data

Their view of all data, sorted descending by date: image

Will report this back to client. Leaving open for now.

danrademacher commented 5 years ago

Well, now they have finally added missing data (not sure when that happened) and our ETL script is finding 33,820 missing records, and then failing to upload them:

2019-01-29T17:00:23.070086+00:00 app[scheduler.9849]:   InsecureRequestWarning)
2019-01-29T17:00:27.087394+00:00 heroku[scheduler.9849]: Process exited with status 1
2019-01-30T17:00:06.398753+00:00 heroku[scheduler.7257]: Starting process with command `python main.py`
2019-01-30T17:00:06.986906+00:00 heroku[scheduler.7257]: State changed from starting to up
2019-01-30T17:00:10.407374+00:00 app[scheduler.7257]:  05:00:10 PM - INFO - Getting data from Socrata SODA API as of 2018-11-30
2019-01-30T17:00:10.497571+00:00 app[scheduler.7257]: /app/.heroku/python/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
2019-01-30T17:00:10.497576+00:00 app[scheduler.7257]:   InsecureRequestWarning)
2019-01-30T17:00:13.709213+00:00 app[scheduler.7257]:  05:00:13 PM - INFO - Got 33820 SODA entries OK
2019-01-30T17:00:13.709274+00:00 app[scheduler.7257]:  05:00:13 PM - INFO - Getting socrata_id list from CARTO as of 2018-11-30
2019-01-30T17:00:14.226384+00:00 app[scheduler.7257]:  05:00:14 PM - ERROR - No socrata_id rows: {"fields": {"socrata_id": {"type": "number"}}, "rows": [], "total_rows": 0, "time": 0.32}

I suspect this might be an issue where we hadn't accounted for this long a data gap, though I am not sure WHY that would cause a breakage.

gregallensworth commented 5 years ago

1. The breakage in this case, is that we are finding 0 rows at CARTO dated after 2018-11-30

We've never seen a this long in NYC/Socrata data, so we never expected to see a condition where 0 rows have been added to CARTO in a two-month period. As such, I had the ETL script figure this was an error and bail for manual examination.

A workaround here was to temporarily change FETCH_HOWMANY_MONTHS to 3 from its usual value of 2 so it would look a bit further back and find something already in CARTO and not bail.

2. A second breakage is that I found another Socrata record (4051650, with 4022160 known previously) which lacks the number_of_persons_killed and number_of_persons_injuredfields. This is a fatal error, and I have added a workaround for it to sum up the 3 person types to get the sum.

This condition been previously seen and worked around, but only in the find_updated_killcounts() condition where we are back-editing existing records. In this case, we are for the first time seeing a brand-new SODA record which lacks these fields.

3. CARTO is now reporting a syntax error in the SQL being generated. I'll have to track that down...

gregallensworth commented 5 years ago

The syntax error is due to a new and unexpected value found in the Socrata data: a vehicle_type_code of GOV'T The ETL script does not escape these strings because all prior values did not need escaping, e.g. Taxi, Sedan, Pick-up Truck.

I have modified the script to strip out ' characters, so as to prevent this in the current data load and in future data loads.

gregallensworth commented 5 years ago

This new batch of some 40,000 records is of course causing problems with the postprocessing steps such as update_borough() and update_city_council() I may need to refactor these into blocks as I did with the update_assembly() andupdate_senate()` so that we can process the new block of 40,000 and also be more tolerant of these sorts of "big backlog" situations in the future.

gregallensworth commented 5 years ago

I have reworked the polygon updates for Borough, Community Board, NYPD Precinct, Council, and Neighborhood so these are also done in % 10 batches same as Assembly and Senate. This should make the ETL somewhat more resilient to large backlogs which would otherwise exceed our per-call API limit.

gregallensworth commented 5 years ago

Today's run brings the stats up to date.

This new ETL script now been deployed to Heroku as well, so tomorrow should remain on keel until the next unforeseen circumstance.

danrademacher commented 5 years ago

Always a new way to break! Thanks for the quick fix.