cenpy-devs / cenpy

Explore and download data from Census APIs
Other
184 stars 44 forks source link

Failure on large TIGER requests #126

Open ronnie-llamado opened 3 years ago

ronnie-llamado commented 3 years ago

Failure

In test_functional_products.py:

aus = dectest.from_msa("Austin, TX", level="block", variables=["^P003", "P001001"])

Fails and returns:

KeyError: 'Response from API is malformed. You may have submitted too many queries,
formatted the request incorrectly, or experienced significant network connectivity issues. 
Check to make sure that your inputs, like placenames, are spelled correctly, and that 
your geographies match the level at which you intend to query. The original error from 
the Census is:\\n(API ERROR 500:Error performing query operation([]))'

The last recorded pass of this test appears to be on 21 Jan 2021 (see build: #1463.1).

Diagnosis

Error performing query operation

According to Esri's support, when "Error performing query operation" is returned from a map service is it due to an extremely large response failing (Source). In the article they state that the default max is 64MB.

We're able to adjust the number of returned results using the MapServer's resultRecordCount, so I did this until failure.

Metropolitan Statistical Area (MSA) Total Features Features Before Failure Size Before Failure (MB)
Austin, TX 42159 22000 31.3
Los Angeles-Long Beach-Anaheim, CA 169626 36000 31.8
Carson City, NV 2354 N/A N/A

My limited tests point to a 32MB limit instead of Esri's stated 64MB default, so there may have been an update server-side. This failure might require a little more rework if it's confirmed.

ljwolf commented 3 years ago

Yeah, this is definitely an update server-side. I think we'll have to figure out a chunked way to get the features now, to power this kind of thing :/

I'll merge the fix in #125 anyway and start thinking about a large-query fix. What we do in the "data" api is to split the query at 50 columns, then just make repeated requests (with a small delay). In this case, we'd need to (1) grab the records within the envelope and (2) split those into chunks based on an estimated request size (which... not sure what the heuristic should be) and then (3) request those in serial.

ronnie-llamado commented 3 years ago

Esri's map service query exposes some parameters that would apply here: returnCountOnly, resultOffset and resultRecordCount (source: Esri Documentation).

A rough version to pull large queries:

  1. Query number of records with returnCountOnly
  2. Query n records at a time until complete with resultOffset and resultRecordCount

That still doesn't address the estimating the size (still unsure), but that simplifies the logic instead of splitting envelopes.