Closed gregallensworth closed 5 years ago
My first hunch turned out to be right: the generated SQL query is just too long, as this wasn't supposed to go back so far in time, finding so many records.
02:06:04 PM - INFO - Got 1694 SODA entries updated since 2018-08-06
Many of these have 7-digit socrata_id
values. Including commas and all, that's 10kB+ in just the listing. Many browsers and web server APIs will cancel any request over 4K in length, on account that it couldn't possibly be legitimate.
A second cause of this issue, looks like New York went through and did a major update to many records in a few days. Running through one 10-day period at a time, I find 50-80 records is a typical number of hits for a 10-day period. But in mid-late August SODA shows an unusually large number of record updates in a short time:
Find SODA records updated since 2018-08-22 until 2018-08-25
Got 1222 SODA entries updated between 2018-08-22 and 2018-08-25
This single block of 1222 updates alone for this fluke of a three-day period would comprise a payload of 10KB, beyond what could be requested of CARTO using the API. How can we best handle such an overwhelming number of records, limiting it to 200ish records at a time when we ask CARTO?
Commit 93ece3a fixes this and brings some new touches:
find_updated_killcounts()
now fetches CARTO IDs in chunks, so even if there are 1400 Socarata IDs to look up, this will be collected in chunks of 200 so as not to overload CARTO with a giant request.
I have defined various time periods as constants for easier modification and reading, e.g. FETCH_HOWMANY_MONTHS = 2 # when looking for new records in SODA< look back how many months?
Discovered while debugging for #15
The
find_updated_killcounts()
function looks back through existing records, so as to find those which are in CARTO but their kill/injure counts may have changed since then (typically an injury who later died after the crash was logged).During the Fetching CARTO entries step: