Closed danrademacher closed 5 years ago
Carto restored the 3 GB limit:
@gregallensworth now that our data quota is restored (see thread in this issue), would you expect the existing nightly scripts to "just pick up" some hundreds of missing crashes since September 6 or so? I think it would, but not sure.
Next to-do item: Check https://dashboard.heroku.com/apps/nyc-crash-mapper-etl/logs on November 14 and we should see a successful run, as well as a few thousand new records described.
Confirmed this morning that the ETL script is not running properly:
2018-11-14T17:00:29.272436+00:00 app[scheduler.2745]: 05:00:29 PM - ERROR - No socrata_id rows
So, I will need to go in and check and repair whatever it's not doing properly now. This may shed light on whether the breakage is related to the quota change.
Dan comments too, that CARTO has on prior occassions changed the API key for the account, when making account changes. The issue here may be the API key, and not the query itself.
Methodology
The issue was not the API key.
The script is configured at several levels, to look back 2 months for new records not yet seen. That is September 14 2018 as the horizon backward. If the API glitch happened before then, we could be missing data for a few days.
If this is the case (see #16) I could do a one-time run of the ETL script, with a window of 3 months set, to catch stragglers.
Item One: 2 months is outside expected window
Found the programming issue which is causing it to abort:
if not 'rows' in alreadydata or not len(alreadydata['rows']):
We are in fact seeing 0 records dated after 2018-09-14 The check above was meant to check for a failure, and did not anticipate that we would ever go 2 months without adding a single row.
Modifying this as follows, allowed the run to continue:
if not 'rows' in alreadydata:
Item Two: 34,954 rows is too many
Now that it proceeds, we are seeing 34,954 new records missed in 2 months of letting this slide.
When the script tries to update, though: You are over platform's limits: SQL query timeout error.
happens after 16 seconds. Simply too many records to insert in a single run using the existing technique.
I made an ad-hoc modification, to do only a few thousand at a time. I then re-ran this 12 times, catching 3000 old records each time.
logger.info('Found {0} new rows to insert into CARTO'.format(len(vals)))
vals = vals[:3000]
logger.info('Sliced off only 3000 this time')
During this time, I also commented out the calls to some functions, which would calculate on every one of those 12 runs, and waste a lot of time:
update_intersections_crashcount()
which tallies crash counts per intersection; this is to be run after all of the inserts have finished, so as to update including the last 2 months of missing datafind_updated_killcounts()
which looks for updated kill/injure numbers for existing crashes; since we have 0 records dating back these 2 months, and are just now discovering these records with the latest kill/injure numbers, this function is unnecessary during this data patchFinal Cleanup
After running the script 12 times as described above, I did a final run to catch stragglers, perform updates, and do post-insert recalculations. That is to say, I re-ran the ETL script:
This did indeed catch a few thousand more records, likely indicating that the account change took place in mod-August and not mid-September.
The new ETL script has been deployed to Heroku. Let's check in tomorrow at https://dashboard.heroku.com/apps/nyc-crash-mapper-etl/logs and make sure it ran nicely then we should be able to close #15.
After the fix in #17 all data should be present as if no interruption had ever happened. The QC in #16 may now proceed.
Client reports no new data since Sept 1.
I logged into CARTO and I see:
Heroku log:
that is also when Christine had problems with her grant, and sure enough they appear to have downgraded her to FREE plan on Sept 6:
I have informed her about this and she'll need to work out with CARTO what the terms of the grant really are.