GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

Suddenly low CARTO quota is killing import #15

Closed danrademacher closed 5 years ago

danrademacher commented 5 years ago

Client reports no new data since Sept 1.

I logged into CARTO and I see: image

Heroku log:

2018-11-12T17:00:25.965375+00:00 app[api]: Starting process with command `python main.py` by user scheduler@addons.heroku.com
2018-11-12T17:00:28.665051+00:00 heroku[scheduler.2390]: State changed from starting to up
2018-11-12T17:00:30.920187+00:00 app[scheduler.2390]:  05:00:30 PM - INFO - Getting data from Socrata SODA API as of 2018-09-12
2018-11-12T17:00:30.975810+00:00 app[scheduler.2390]: /app/.heroku/python/lib/python2.7/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
2018-11-12T17:00:30.975815+00:00 app[scheduler.2390]:   InsecureRequestWarning)
2018-11-12T17:00:33.900943+00:00 app[scheduler.2390]:  05:00:33 PM - INFO - Got 36452 SODA entries OK
2018-11-12T17:00:33.901019+00:00 app[scheduler.2390]:  05:00:33 PM - INFO - Getting socrata_id list from CARTO as of 2018-09-12
2018-11-12T17:00:34.408454+00:00 app[scheduler.2390]:  05:00:34 PM - ERROR - No socrata_id rows: {"time": 0.295, "fields": {"socrata_id": {"type": "number"}}, "total_rows": 0, "rows": []}
2018-11-12T17:00:34.619079+00:00 heroku[scheduler.2390]: Process exited with status 1
2018-11-12T17:00:34.660615+00:00 heroku[scheduler.2390]: State changed from up to complete

that is also when Christine had problems with her grant, and sure enough they appear to have downgraded her to FREE plan on Sept 6: image

I have informed her about this and she'll need to work out with CARTO what the terms of the grant really are.

danrademacher commented 5 years ago

Carto restored the 3 GB limit: image

danrademacher commented 5 years ago

@gregallensworth now that our data quota is restored (see thread in this issue), would you expect the existing nightly scripts to "just pick up" some hundreds of missing crashes since September 6 or so? I think it would, but not sure.

gregallensworth commented 5 years ago

Next to-do item: Check https://dashboard.heroku.com/apps/nyc-crash-mapper-etl/logs on November 14 and we should see a successful run, as well as a few thousand new records described.

gregallensworth commented 5 years ago

Confirmed this morning that the ETL script is not running properly:

2018-11-14T17:00:29.272436+00:00 app[scheduler.2745]:  05:00:29 PM - ERROR - No socrata_id rows

So, I will need to go in and check and repair whatever it's not doing properly now. This may shed light on whether the breakage is related to the quota change.

Dan comments too, that CARTO has on prior occassions changed the API key for the account, when making account changes. The issue here may be the API key, and not the query itself.

gregallensworth commented 5 years ago

Methodology

The issue was not the API key.

The script is configured at several levels, to look back 2 months for new records not yet seen. That is September 14 2018 as the horizon backward. If the API glitch happened before then, we could be missing data for a few days.

If this is the case (see #16) I could do a one-time run of the ETL script, with a window of 3 months set, to catch stragglers.

Item One: 2 months is outside expected window

Found the programming issue which is causing it to abort: if not 'rows' in alreadydata or not len(alreadydata['rows']):

We are in fact seeing 0 records dated after 2018-09-14 The check above was meant to check for a failure, and did not anticipate that we would ever go 2 months without adding a single row.

Modifying this as follows, allowed the run to continue: if not 'rows' in alreadydata:

Item Two: 34,954 rows is too many

Now that it proceeds, we are seeing 34,954 new records missed in 2 months of letting this slide.

When the script tries to update, though: You are over platform's limits: SQL query timeout error. happens after 16 seconds. Simply too many records to insert in a single run using the existing technique.

I made an ad-hoc modification, to do only a few thousand at a time. I then re-ran this 12 times, catching 3000 old records each time.

    logger.info('Found {0} new rows to insert into CARTO'.format(len(vals)))

    vals = vals[:3000]
    logger.info('Sliced off only 3000 this time')

During this time, I also commented out the calls to some functions, which would calculate on every one of those 12 runs, and waste a lot of time:

Final Cleanup

After running the script 12 times as described above, I did a final run to catch stragglers, perform updates, and do post-insert recalculations. That is to say, I re-ran the ETL script:

This did indeed catch a few thousand more records, likely indicating that the account change took place in mod-August and not mid-September.

gregallensworth commented 5 years ago