EOL / tramea

A lightweight server for denormalized EOL data
Other
2 stars 1 forks source link

API performance problems #229

Closed jhpoelen closed 8 years ago

jhpoelen commented 8 years ago

Hi! When looking up common names for humans (Homo sapiens, eol page id 327955), I see 503 "Service unavailable" codes or very long response times (>10s). This behavior appeared over the last week, and was working reasonably before that.

Please let me know if this is expected behavior.

screen shot 2016-03-23 at 10 35 22 am

jhammock commented 8 years ago

@holmesjtg also reporting similar behavior and also, in Pages, unexpected results:

http://eol.org/api/pages/1.0/1053894.json?images=50&videos=0&sounds=0&maps=0&text=0&iucn=false&subjects=&licenses=all&details=true&common_names=true&synonyms=false&references=false&vetted=0&cache_ttl=120 , when it works, returns only one image.

jhammock commented 8 years ago

Other APIs are also shaky, but running. (Results after several tries.)

JRice commented 8 years ago

Doubled the number of workers, restarted the containers.

I'll restart Solr tomorrow morning.

...Things seem ... moderately stable... on my end, now. Slow, but....

FWIW, in the last 14 hours or so, we've had 34,407 API hits from 195.221.175.168 and 28,851 from 23.97.187.30 (on ONE of two servers, so probably double that)...The rest of the IPs are moderate tallies.

195.221.175.168 is from Montpellier, France; 23.97.187.30 is from Amsterdam, Netherlands. Anyone know who those might be? Just curious.

jhpoelen commented 8 years ago

Thanks for responding so quickly. I tried to load http://eol.org/api/pages/1.0/327955.json?common_names=true in my Firefox browser and still getting intermittent another 503. The link http://eol.org/api (for documentation) also returns 503 links. Perhaps you have some nodes in your cluster that are not hosting what they are support to?

JRice commented 8 years ago

I just blocked an IP (195.221.175.168) that was using search too aggressively. Things might be better now. :S

The problem is almost certainly that, though: too many API searches too quickly; it overwhelms Solr which chokes up the API servers, and eventually the gateway thinks the machine is non-responsive, so it starts throwing 503s. :S

This should improve when we improve Solr...

jhpoelen commented 8 years ago

Thanks for looking into this, I can imagine that dealing unwieldy traffic can be a bit of a challenge. I noticed an nginx module that might be able to enforce some kind of policy (http://nginx.org/en/docs/http/ngx_http_limit_req_module.html). You are probably aware of such plugins.

I've increased the GloBI timeouts from 5 seconds to 5 minutes and the automated tests are passing now. Closing issue.

JRice commented 8 years ago

I did not actually know about that (as I say often at work, I Am Not A Sysadmin™), so thanks!

I've just implemented that (at a rate of 20 r/m, which is a little higher than I'd love, but I slowing it down more than that seems draconian).

Let's see if that stabilizes things a bit.

Thanks!

UPDATE: that didn't do what I wanted it to, so I disabled it. I'll look into similar solutions.

JRice commented 8 years ago

Reopening this, as it continues to be a problem, which is why I just did that. I want to keep this open until we're sure it's resolved.

JRice commented 8 years ago

Clearly not adequate. Still getting 503s.

JRice commented 8 years ago

I really do think this is fixed, now... but will check back later.

...That said, I think someone is abusing us again. :S Looking into that.

JRice commented 8 years ago

API seems to have stayed up pretty solidly since the fix, response times seem pretty good given the super-high load. I'm resolving this now.