Closed danrademacher closed 5 years ago
Well, now they have finally added missing data (not sure when that happened) and our ETL script is finding 33,820 missing records, and then failing to upload them:
2019-01-29T17:00:23.070086+00:00 app[scheduler.9849]: InsecureRequestWarning)
2019-01-29T17:00:27.087394+00:00 heroku[scheduler.9849]: Process exited with status 1
2019-01-30T17:00:06.398753+00:00 heroku[scheduler.7257]: Starting process with command `python main.py`
2019-01-30T17:00:06.986906+00:00 heroku[scheduler.7257]: State changed from starting to up
2019-01-30T17:00:10.407374+00:00 app[scheduler.7257]: 05:00:10 PM - INFO - Getting data from Socrata SODA API as of 2018-11-30
2019-01-30T17:00:10.497571+00:00 app[scheduler.7257]: /app/.heroku/python/lib/python2.7/site-packages/urllib3/connectionpool.py:847: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
2019-01-30T17:00:10.497576+00:00 app[scheduler.7257]: InsecureRequestWarning)
2019-01-30T17:00:13.709213+00:00 app[scheduler.7257]: 05:00:13 PM - INFO - Got 33820 SODA entries OK
2019-01-30T17:00:13.709274+00:00 app[scheduler.7257]: 05:00:13 PM - INFO - Getting socrata_id list from CARTO as of 2018-11-30
2019-01-30T17:00:14.226384+00:00 app[scheduler.7257]: 05:00:14 PM - ERROR - No socrata_id rows: {"fields": {"socrata_id": {"type": "number"}}, "rows": [], "total_rows": 0, "time": 0.32}
I suspect this might be an issue where we hadn't accounted for this long a data gap, though I am not sure WHY that would cause a breakage.
1. The breakage in this case, is that we are finding 0 rows at CARTO dated after 2018-11-30
We've never seen a this long in NYC/Socrata data, so we never expected to see a condition where 0 rows have been added to CARTO in a two-month period. As such, I had the ETL script figure this was an error and bail for manual examination.
A workaround here was to temporarily change FETCH_HOWMANY_MONTHS
to 3 from its usual value of 2 so it would look a bit further back and find something already in CARTO and not bail.
2.
A second breakage is that I found another Socrata record (4051650
, with 4022160
known previously) which lacks the number_of_persons_killed
and number_of_persons_injured
fields. This is a fatal error, and I have added a workaround for it to sum up the 3 person types to get the sum.
This condition been previously seen and worked around, but only in the find_updated_killcounts()
condition where we are back-editing existing records. In this case, we are for the first time seeing a brand-new SODA record which lacks these fields.
3. CARTO is now reporting a syntax error in the SQL being generated. I'll have to track that down...
The syntax error is due to a new and unexpected value found in the Socrata data: a vehicle_type_code
of GOV'T The ETL script does not escape these strings because all prior values did not need escaping, e.g. Taxi, Sedan, Pick-up Truck.
I have modified the script to strip out ' characters, so as to prevent this in the current data load and in future data loads.
This new batch of some 40,000 records is of course causing problems with the postprocessing steps such as update_borough()
and update_city_council()
I may need to refactor these into blocks as I did with the update_assembly() and
update_senate()` so that we can process the new block of 40,000 and also be more tolerant of these sorts of "big backlog" situations in the future.
I have reworked the polygon updates for Borough, Community Board, NYPD Precinct, Council, and Neighborhood so these are also done in % 10
batches same as Assembly and Senate. This should make the ETL somewhat more resilient to large backlogs which would otherwise exceed our per-call API limit.
Today's run brings the stats up to date.
This new ETL script now been deployed to Heroku as well, so tomorrow should remain on keel until the next unforeseen circumstance.
Always a new way to break! Thanks for the quick fix.
Hmm, client reports and I just confirmed in https://chekpeds.carto.com/tables/crashes_all_prod that we have no crashes since Nov 16, 2018.
Looking at the logs at https://dashboard.heroku.com/apps/nyc-crash-mapper-etl/logs, I can see that the scheduled task is running.
Ah, here's a new one, NYC data has nothing since 11/16. https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95/data
Their view of all data, sorted descending by date:
Will report this back to client. Leaving open for now.