Closed giner closed 5 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/148335845
The labels on this github issue will be updated when the story is started.
After load test stopped application continue to receive requests from gorouter for quite a long time (few minutes)
Just to clarify: this means that gorouter seems to keep retrying even if the client has gone away. This is a fairly surprising behavior: if the client has gone away retries (more in general, any upstream connection for that client connection) should be immediately aborted.
The problem is that retries are not limited in time and depending on behavior of the application they can take 10th of seconds or even minutes.
Gorouter will retry only if it cannot initiate a TCP connection with a selected backend. Gorouter has a dial timeout of 5 seconds. This means that if Gorouter cannot establish a TCP connection to any registered backends, it will return a 502 in a maximum of 15 seconds. On its own, do you find this retry logic to be a problem?
If Gorouter can establish a TCP connection then it will wait for a response with a timeout defined by request_timeout_in_seconds
.
When you are running your load test, are you canceling connections? I don't understand how your test could finish and Gorouter is still retrying requests. This would indicate the component downstream to Gorouter still has many connections open and is waiting for responses.
We are aware that Gorouter has an issue where file descriptors are not cleaned up immediately when a client cancels a connection (e.g. killall curl
). We are investigating this.
We have added a timeout that closes idle client connections after 5 seconds, but this doesn't address connections with an active request (Gorouter waiting for a response, or retrying backends).
In https://www.pivotaltracker.com/story/show/147977289 we are adding support for a max connections per backend property that should guard against an unresponsive app or a DOS attack against a single app from exhausting the file descriptors in Gorouter.
Hi @shalako,
I have realized that I'm not always able to reproduce the issue. I'll spend some more time trying to figure out the conditions and isolate it.
I think I have a bigger picture now.
Let's consider the request flow withing a simplified scheme: client -> gorouter -> backend => => backend -> gorouter -> client
and some hypothetical situations:
All above can happen when:
The situations above cause a lot of connections (including half-open connections) and some other undesired behaviors which in turn cause load on gorouter (and we see these problems from time to time).
Ideal story would be:
Here is what we can do:
Here are some pictures.
Backend is slow to reply case
Data is still coming from backend but client has gone case (note that endpoint_timeout on gorouter of router_dev_z3 is lowered from 900 to 75)
@giner :
Going through the list of things we can do as suggested
Configure reasonably short read / write timeout to backend. Note: I found "endpoint_timeout" in gorouter which defaults to 900s (which is too long) but it is not possible to change it now without affecting "drain_timeout". Here is a PR for to expose drain_timeout cloudfoundry-incubator/routing-release#82.
Waiting for @shalako to prioritize that issue.
Implement read / write timeout for to client (I didn't manage to find one in gorouter configuration).
Currently gorouter config does not expose read and write timeout for client. Looking through the source code we do not have any hard coded limits either which means there is no limit on this I/O operation.
Implement TCP keepalive to backed (in order not to wait for endpoint_timeout if backend is actually gone) #185
Waiting for @shalako to prioritize that issue.
Add an option to limit number of connections per backend (as already mentioned above) https://www.pivotaltracker.com/story/show/147977289
Already finished and in final release of routing.
So only thing left in this issue would be consider connection read and write limits. Is it reasonable to wait until the recommendations are fixed and check if the issue still persists?
Regards Shash
Max Connections per backend is supported. cloudfoundry/routing-release#82 was closed by @giner. We've closed #185 in favor of support for the same feature in Envoy.
As that addresses the suggested solutions, we'll close this one, too.
Thanks for submitting an issue to
gorouter
. We are always trying to improve! To help us, please fill out the following template.Issue
Single misbehaving or overloaded application can exhaust gorouter
Context
cf-release 250
Steps to Reproduce
Expected result
Current result
Possible Fix
Right now gorouter retries 3 times in case a connection is not accepted by an application instance. The problem is that retries are not limited in time and depending on behavior of the application they can take 10th of seconds or even minutes. One oossible fix would be to implement MaxRetriesTimeout in addition to MaxRetries.Please read comments.