cenpy-devs / cenpy

Explore and download data from Census APIs
Other
184 stars 44 forks source link

Really high latency for many cenpy operations #157

Open brnt opened 1 year ago

brnt commented 1 year ago

I pulled out a project I hadn't used in a month or two and found that cenpy now introduces a huge amount of latency. Importing the module does it. Instantiating a cenpy.products.ACS object does it again.

To narrow in on just the import statement:

tinker-toys/live_map ∴ time echo 'import cenpy'|python
echo 'import cenpy'  0.00s user 0.00s system 36% cpu 0.003 total
python  1.83s user 0.77s system 2% cpu 1:42.25 total

Note that the import alone takes 1 min 42 sec (!). I've tried it both with and without an API key (including deleting SITEKEY.txt). Also note that this was the second run of the same command, just in case there was some module compilation happening the first time.

I've also double-checked that it's not a network problem. Pings look normal (sample of output during import test above):

64 bytes from 172.217.12.14: icmp_seq=266 ttl=57 time=14.131 ms
64 bytes from 172.217.12.14: icmp_seq=267 ttl=57 time=12.664 ms
64 bytes from 172.217.12.14: icmp_seq=268 ttl=57 time=12.661 ms
64 bytes from 172.217.12.14: icmp_seq=269 ttl=57 time=16.833 ms
64 bytes from 172.217.12.14: icmp_seq=270 ttl=57 time=17.587 ms
64 bytes from 172.217.12.14: icmp_seq=271 ttl=57 time=18.663 ms
64 bytes from 172.217.12.14: icmp_seq=272 ttl=57 time=17.251 ms
64 bytes from 172.217.12.14: icmp_seq=273 ttl=57 time=12.557 ms
64 bytes from 172.217.12.14: icmp_seq=274 ttl=57 time=16.647 ms

I realize that this may be an issue with the census.gov servers, rather than cenpy. It may also be an issue with MacOS (see upgrade note below).

Potentially relevant info:

brnt commented 1 year ago

Follow-up question: Could this high latency be some sort of grey-list throttling on the part of api.census.gov? The reason I ask is that my IP address appears to have been blacklisted. I don't get any kind of definitive message from the server, but connecting through a VPN works (and works at full speed), while the remote endpoint disconnects violently when I connect directly.

For the record, I haven't been pounding the API at all. I've rarely touched it since reporting this latency two weeks ago. Not sure how this IP address might have gotten blacklisted. And my API key still works from other IP addresses.

ljwolf commented 1 year ago

Hi, sorry for the delay in replying.

I haven't been able to replicate this myself, but I have heard of other users having issues with really high latency using the service. I have reached out to the USCB folks to see if there's any change in the policy that @cenpy-devs missed, but I haven't seen anything or gotten a response.

It's entirely possible that the rate limiting is IP specific. Are you accessing it from a shared endpoint?

I've got a few ideas on how to make any greylisting less likely, and we're working to spec a Google summer of code project with this. Mainly, we hope to begin using requests.Session() objects using a cenpy-specific user agent string, rather than making ad hoc requests directly.

brnt commented 1 year ago

Hi, sorry for the delay in replying.

No problem at all. Thanks for taking a minute to respond.

I haven't been able to replicate this myself, but I have heard of other users having issues with really high latency using the service. I have reached out to the USCB folks to see if there's any change in the policy that @cenpy-devs missed, but I haven't seen anything or gotten a response.

It's entirely possible that the rate limiting is IP specific. Are you accessing it from a shared endpoint?

Not a shared endpoint. If the issue is indeed rate limiting on the USCB side (also my current best guess), then either (1) the query threshold must be extremely low; or (2) cenpy could be making dozens of requests per conceptual query. I haven't dived into the cenpy code to verify, but it could be doing repeated queries for tract-level stats across a city or something. You'll know better than I would.

I've got a few ideas on how to make any greylisting less likely, and we're working to spec a Google summer of code project with this. Mainly, we hope to begin using requests.Session() objects using a cenpy-specific user agent string, rather than making ad hoc requests directly.

It might also be helpful to communicate back to the USCB folks that an email warning when traffic exceeds some threshold would be extremely helpful. They have the email addresses of anyone that has signed up for and verified an API key.

Thanks for your help! I'll keep using the VPN for now, and I'll watch here for any new info.

ljwolf commented 1 year ago

it could be doing repeated queries for one conceptual query

Indeed, this is done. When we wrote the package, there was a 50-column limit on individual queries. So, queries for tons of columns get split into columnar chunks and are put back together at the end. However, this scales linearly over columns, and rarely caused issues before.

I'll update here with any changes from USCB.