Missing Feb 2020 crashes, exceeding API limits

danrademacher commented 4 years ago

We have a reasonable number for January -- 13545 as a result of SELECT count(*) FROM crashes_all_prod where date_val >= '2020-01-01' and date_val <= '2020-01-31'

But in February we have only 358 as a result of SELECT count(*) FROM crashes_all_prod where date_val >= '2020-02-01' and date_val <= '2020-02-29'

And the map looks it:

We made changes to the ETL script in early January, but then it ran fine for 20+ days.

Based on https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data, there should be about 13,300 crashes in Feb.

Aha, damn, here it is in Heroku logs:

2020-03-03T16:30:21.225483+00:00 app[scheduler.6563]: 04:30:21 PM - INFO - Found 10233 new rows to insert into CARTO
2020-03-03T16:30:21.225538+00:00 app[scheduler.6563]: 04:30:21 PM - INFO - Creating CARTO SQL insert for 10233 new rows
2020-03-03T16:30:27.693163+00:00 app[scheduler.6563]: 04:30:27 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-03-03T16:30:28.020869+00:00 app[scheduler.6563]: 04:30:28 PM - INFO - {"error":["Quota exceeded by 565604KB"]}

We refactored the ETL to use their batch API specifically to avoid this, but

@fahadkirmani you set up emails so you should get alerts about things like this. Is that system not working? The goal of the notifications to you is that we discover these issues before we're 12,000 crashes behind.

@gregallensworth maybe we need to break the batches into even smaller payloads?

danrademacher commented 4 years ago

Hmm, I wonder if this is not an API rate limit at all but CARTO.com again lowering Christine's overall storage quota.

Seeing this in the GUI:

Heroku is saying we exceeded our quota by 565,604KB -- that's half a gig! So that's not our ETL trying to shove that much over the wire.

And based on that readout, I see (1051.8-500)*1000 = about 552,000KB. Different but close.

I'll report this to Christine.

danrademacher commented 4 years ago

Hmm, OK, they have restored our larger quota:

Looks like we see the same kind of errors in the last Heroku run. We'll see if they get picked up tomorrow.

I did try to run main.py locally, which I have done successfully in the past, but now it fails because sendgrid doesn't run locally after pip install. Will file a separate issue for @fahadkirmani on that.

danrademacher commented 4 years ago

Still no crashes, but the error as changed:

2020-03-06T16:31:08.652080+00:00 app[scheduler.8270]: 04:31:08 PM - INFO - Got 26764 SODA entries OK
2020-03-06T16:31:08.652281+00:00 app[scheduler.8270]: 04:31:08 PM - INFO - Getting socrata_id list from CARTO as of 2020-01-06
2020-03-06T16:31:09.338613+00:00 app[scheduler.8270]: 04:31:09 PM - INFO - Got 12527 socrata_id entries for existing CARTO records
2020-03-06T16:31:17.440741+00:00 app[scheduler.8270]: 04:31:17 PM - INFO - Found 14237 new rows to insert into CARTO
2020-03-06T16:31:17.440852+00:00 app[scheduler.8270]: 04:31:17 PM - INFO - Creating CARTO SQL insert for 14237 new rows
2020-03-06T16:31:45.199978+00:00 app[scheduler.8270]: 04:31:45 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}

danrademacher commented 4 years ago

So this part of the code creates with SQL insert: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/master/main.py#L300-L375

In this commit, @gregallensworth refactored other parts of the import script to use CARTO's batch API, https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/commit/54e4915582303e0455ccf02b2f24461fcc343c85

Perhaps we can do the same for the insert?

danrademacher commented 4 years ago

Forgot to @gregallensworth on this issue at noon! Here it is with history of what happened, latest error messages and suggestion to refactor the insert query to use Batch API.

danrademacher commented 4 years ago

For the record, as a stopgap, I tried dropping the max records from Socrata from 50000 all the way down to 250 and the insert still fails with same “over query limit” error

Christine confirmed with CARTO that we have enterprise level rate limits, so they must have clamped down on those.

Should be moot if we use batch, but since we used to regularly add 200+ crashes a day, seems like something did change in the existing api limits

danrademacher commented 4 years ago

Setting it to 100 succeeds. And that seems about the limit with the standard API.

So with a missing set of about 13000 records in February, the existing approach would have to run 130 times to catch up. And that is not consistently true. Experimenting with incrementing the SODA limit by 100, so there are always 100 new crashes, even that insert will fail sometimes.

gregallensworth commented 4 years ago

Update 1: Batch API is no-go

The batch API is not suitable here, because the batch API has a limit of 16 KB of payload (SQL string). Today's insert of 15,000 records is about 4.7 MB of content, and a single day's crash content would be right at that 16 KB limit and over that on some particularly busy data-entry days.

{u'error': [u'Your payload is too large: 4998923 bytes. Max size allowed is 16384 bytes (16kb). Are you trying to import data?. Please, check out import api http://docs.cartodb.com/cartodb-platform/import-api/']}

gregallensworth commented 4 years ago

Update 2: Socrata limits work, not scalable

Dan's idea of limiting the Socrata records seems to work for the short-term, but is not suitable for day-to-day usage. For one, making 500 queries to Socrata each day for the top 100 records, the top 200 records, the top 300 records, the top 400 records, ... would mean 500 Socrata hits and 500 CARTO inserts, and a running time around 300 minutes each night.

But... having hit Socrata for the usual 50,000 records and formatted a set of rows to be inserted into CARTO... what if instead we broke up that CARTO insert into chunks?

gregallensworth commented 4 years ago

Update 3: A successful way forward

Experimentally, we determined that CARTO's new API limits and current performance means that the CARTO API can only handle 100 new crash records at a time, and that takes about 18 seconds. I have reworked the part of the script which does the big INSERT to instead break up into smaller INSERT statements, of 100 crashes per block.

That means that today, to catch up, will be about 150 such blocks @ 20 seconds apiece = about 1 hour. It means that future days will typically run in 4-1 batches and thus 1-3 minutes, which seems acceptable.

The "backlog" data load has completed, and in crashmapper.org for Feb 2010, I see 3,030 crashes with any sort of injury and fatality, plus 10,559 crashes with no injury/fatality. That sounds like 13,000.

gregallensworth commented 4 years ago

A secondary question, is what happened in January to cause this failure.

CARTO's query limit was 30 seconds back then, and remains so today. So that is not a variable.

The number of records in the database has grown by a few thousand records per day, so it's possible that the time taken to perform the INSERT query was slowly creeping upward this whole time: 25 seconds, 29 seconds, and finally the fatal 30.1 seconds. But since I don't have detailed metrics on the running times over the past 3 years, this is conjecture. If this really were the case, it's possible that the chunk size may need to be reduced further in a few years.

It is also possible that CARTO has simply slowed down a bit, due to increased demands and/or due to de-prioritizing some clients on free/grant programs. Again, without metrics on the insert's running time, I have no evidence of this. If true, CARTO staff are unlikely to have sufficiently detailed metrics to confirm that these queries are specifically slowed, and there would be nothing we could do if it were true.

danrademacher commented 4 years ago

The key known cause was a sudden and "inadvertent" 75% reduction in our overall storage limit at CARTO, so when this quota got cut: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/issues/25#issuecomment-594302056

ANY insert would fail.

gregallensworth commented 4 years ago

Replying to https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/issues/25#issuecomment-598857797

Indeed, that storage reduction would have caused an acute failure. If that had been the only factor, then restoring the storage limit would not have resulted in query timeout errors as a second problem.

The API query timeout happens when I try to insert more than 125-or-so records into the DB at once. Thus, the need for this 100-at-a-time loop. Contrast with prior behavior where an entire day's crashes (500 not being atypical) would go in a single insert query.

This indicates that something at CARTO has slowed down, such that queries that previously inserted 500 records in <30 seconds, now take 20 seconds to insert 100 records.

GreenInfo-Network / nyc-crash-mapper-etl-script

Missing Feb 2020 crashes, exceeding API limits #25