GreenInfo-Network / nyc-crash-mapper-etl-script

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.
3 stars 0 forks source link

find_updated_killcounts() bailing with JSON error #17

Closed gregallensworth closed 5 years ago

gregallensworth commented 5 years ago

Discovered while debugging for #15

The find_updated_killcounts() function looks back through existing records, so as to find those which are in CARTO but their kill/injure counts may have changed since then (typically an injury who later died after the crash was logged).

During the Fetching CARTO entries step:

  File "main.py", line 566, in find_updated_killcounts
    'q': "SELECT * FROM {0} WHERE socrata_id IN ({1})".format(CARTO_CRASHES_TABLE, crashidlist),

ValueError: No JSON object could be decoded
gregallensworth commented 5 years ago

My first hunch turned out to be right: the generated SQL query is just too long, as this wasn't supposed to go back so far in time, finding so many records.

 02:06:04 PM - INFO - Got 1694 SODA entries updated since 2018-08-06

Many of these have 7-digit socrata_id values. Including commas and all, that's 10kB+ in just the listing. Many browsers and web server APIs will cancel any request over 4K in length, on account that it couldn't possibly be legitimate.

A second cause of this issue, looks like New York went through and did a major update to many records in a few days. Running through one 10-day period at a time, I find 50-80 records is a typical number of hits for a 10-day period. But in mid-late August SODA shows an unusually large number of record updates in a short time:

Find SODA records updated since 2018-08-22 until 2018-08-25
Got 1222 SODA entries updated between 2018-08-22 and 2018-08-25

This single block of 1222 updates alone for this fluke of a three-day period would comprise a payload of 10KB, beyond what could be requested of CARTO using the API. How can we best handle such an overwhelming number of records, limiting it to 200ish records at a time when we ask CARTO?

gregallensworth commented 5 years ago

Commit 93ece3a fixes this and brings some new touches: